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ABSTRACT 

The SPARQL query language is a recent W3C standard for 
processing RDF data, a format that has been developed to 
encode information in a machine-readable way. We inves- 
tigate the foundations of SPARQL query optimization and 
(a) provide novel complexity results for the SPARQL evalu- 
ation problem, showing that the main source of complexity 
is operator Optional alone; (b) propose a comprehensive 
set of algebraic query rewriting rules; (c) present a frame- 
work for constraint-based SPARQL optimization based upon 
the well-known chase procedure for Conjunctive Query min- 
imization. In this line, we develop two novel termination 
conditions for the chase. They subsume the strongest condi- 
tions known so far and do not increase the complexity of the 
recognition problem, thus making a larger class of both Con- 
junctive and SPARQL queries amenable to constraint-based 
optimization. Our results are of immediate practical interest 
and might empower any SPARQL query optimizer. 

1. Introduction 

The SPARQL Protocol and Query Language is a recent 
W3C recommendation that has been developed to extract 
information from data encoded using the Resource Descrip- 
tion Framework (RDF) 1 14|. From a technical point of view, 
RDF databases are collections of (subject,predicate,object) 
triples. Each triple encodes the binary relation predicate be- 
tween subject and object, i.e. represents a single knowledge 
fact. Due to their homogeneous structure, RDF databases can 
be seen as labeled directed graphs, where each triple defines 
an edge from the subject to the object node under label pred- 
icate. While originally designed to encode knowledge in the 
Semantic Web in a machine-readable format, RDF has found 
its way out of the Semantic Web community and entered the 
wider discourse of Computer Science. Coming along with 
its application in other areas, such as bio informatics, data 
publishing, or data integration, large RDF repositories have 
been created (cf. Ii29l ). It has repeatedly been observed that 
the database community is facing new challenges to cope 
with the specifics of the RDF data format ITTI [TSlimiSlI . 
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With SPARQL, the W3C has recommended a declarative 
query language that allows to extract data from RDF graphs. 
SPARQL comes with a powerful graph matching facility, 
whose basic construct are so-called triple patterns. During 
query evaluation, variables inside these patterns are matched 
against the RDF input graph. The solution of the evaluation 
process is then described by a set of mappings, where each 
mapping associates a set of variables with RDF graph com- 
ponents. SPARQL additionally provides a set of operators 
(namely And, Filter, Optional, Select, and Union), 
which can be used to compose more expressive queries. 

One key contribution in this paper is a comprehensive com- 
plexity analysis for fragments of SPARQL. We follow pre- 
vious approaches [26| and use the complexity of the Eval- 
uation problem as a yardstick: given query Q, data set D, 
and candidate solution S as input, check if S is contained in 
the result of evaluating Q on D. In |26| it has been shown 
that full SPARQL is PSPACE-complete, which is bad news 
from a complexity point of view. We show that yet op- 
erator Optional alone makes the Evaluation problem 
PSPACE-hard. Motivated by this result, we further refine our 
analysis and prove better complexity bounds for fragments 
with restricted nesting depth of Optional expressions. 

Having established this theoretical background, we turn 
towards SPARQL query optimization. The semantics of 
SPARQL is formally defined on top of a compact algebra 
over mapping sets. In the evaluation process, the SPARQL 
operators are first translated into algebraic operations, which 
are then directly evaluated on the data set. The SPARQL 
Algebra (SA) comprises operations such as join, union, left 
outer join, difference, projection, and selection, akin to the 
operators defined in Relational Algebra (RA). At first glance, 
there are many parallels between SA and RA; in fact, the 
study in [1] reveals that SA and RA have exactly the same 
expressive power. Though, the technically involved proof 
in 11] indicates that a semantics-preserving SA-to-RA trans- 
lation is far from being trivial (cf. (6]). Hence, although both 
algebras provide similar operators, there are still very funda- 
mental differences between both. One of the most striking 
discrepancies, as also argued in Il26l . is that joins in RA are 
rejecting over null-values, but in SA, where the schema is 
loose in the sense that mappings may bind an arbitrary set 
of variables, joins over unbound variables (essentially the 
equivalent of RA null-values) are always accepting. 



One direct implication is that not all equivalences that hold 
in RA also hold in S A, and vice versa, which calls for a study 
of SA by its own. In response, we present an elaborate study 
of SA in the second part of the paper We survey exist- 
ing and develop new algebraic equivalences, covering vari- 
ous SA operators, their interaction, and their relation to the 
RA counterparts. When interpreted as rewriting rules, these 
equivalences form the theoretical foundations for transferring 
established RA optimization techniques, such as filter push- 
ing, into the SPARQL context. Going beyond the adaption 
of existing techniques, we also address SPARQL-specific is- 
sues, e.g. provide rules for simplifying expressions involving 
(closed-world) negation, which can be expressed in SPARQL 
syntax using a combination of Optional and Filter. 

We note that in the past much research effort has been 
spent in processing RDF data with traditional systems, such 
as relational DBMSs or datalog engines ir7l [T8ll31ll25lll2ll21l 
[27 1, thus falling back on established optimization strategies. 
Some of them (e.g. |3F 12|) work well in practice, but are 
limited to small fragments, such as AND-only queries. More 
complete approaches (e.g. UJ) suffer from performance bot- 
tlenecks for complex queries, often caused by poor optimiza- 
tion results (cf. ifTSl 1211 1221). For instance, |21| identifies 
deficiencies of existing schemes for queries involving nega- 
tion, a problem that we tackle in our analysis. This also 
shows that traditional approaches are not laid out for the spe- 
cific challenges that come along with SPARQL processing 
and urges the need for a thorough investigation of SA. 

In the final part of the paper we study constraint-based 
query optimization in the context of SPARQL, also known 
as Semantic Query Optimization (SQO). SQO has been ap- 
plied successfully in other contexts before, such as Conjunc- 
tive Query (CQ) optimization (e.g., |3|), relational databases 
(e.g., |17|), and deductive databases (e.g., |5|). We demon- 
strated the prospects of SQO for SPARQL in [19], and in 
this work we lay the foundations for a schematic semantic 
optimization approach. Our SQO scheme builds upon the 
Chase & Backchase (C&B) algorithm |9|, an extension of 
the well-known chase procedure for CQ optimization |24, 3 , 
[161 . One key problem with the chase is that it might not al- 
ways terminate. Even worse, it has recently been shown that 
for an arbitrary set of constraints it is undecidable if it termi- 
nates or not 1 8 1 . There exist, however, sufficient conditions 
for the termination of the chase; the best condition known so 
far is that of stratified constraints [8 1. The definition of strat- 
ification uses a former termination condition for the chase, 
namely weak acyclicity |28|. In this paper, we present two 
provably stronger termination conditions, making a larger 
class of CQs amenable to the semantic optimization process 
and generalizing the methods introduced in [28"8l. 

Our first condition, called safety, strictly subsumes weak 
acyclicity and the second, safe restriction, strictly subsumes 
stratification. They do not increase the complexity of the 
recognition problem, i.e. safety is checkable in polynomial 
time (like weak acyclicity) and safe restriction by a CONP- 
algorithm (like stratification). We emphasize that our results 
immediately carry over to data exchange |[28[| and integra- 
tion |[20i . query answering using views ITSll . and the implica- 



tion problem for constraints. Further, they apply to the core 
chase introduced in 1 8 1 (there it was proven that the termina- 
tion of the chase implies termination of the core chase). 

In order to optimize SPARQL queries, we translate And- 
blocks of the query into CQs, optimize them using the C&B- 
algorithm, and translate the outcome back into SPARQL. Ad- 
ditionally, we provide optimization rules that go beyond such 
simple queries, showing that in some cases Optional- and 
FiLTER-queries can be simplified. With respect to chase ter- 
mination, we introduce two alternate SPARQL-to-CQ trans- 
lation schemes. They differ w.r.t. the termination conditions 
that they exhibit for the subsequent chase, i.e. our sufficient 
chase termination conditions might guarantee termination for 
the first but not for the second translation, and vice versa. 

Our key contributions can be summarized as follows. 

• We present previously unknown complexity results for 
fragments of the SPARQL query language, showing that 
the main source of complexity is operator Optional 
alone. Moreover, we prove there are better bounds when 
restricting the nesting depth of Optional expressions. 

• We summarize existent and establish new equivalences 
over SPARQL Algebra. Our extensive study character- 
izes the algebraic operators and their interaction, and 
might empower any SPARQL query optimizer. We also 
indicate an erratum in |26| and discuss its implications. 

• Our novel SQO scheme for SPARQL can be used to 
optimize AND-only queries under a set of constraints. 
Further, we provide rules for semantic optimization of 
queries involving operators Optional and Filter. 

• We present two novel sufficient termination conditions 
for the chase, which strictly generalize previous condi- 
tions. This improvement empowers the practicability of 
many important research areas, like e.g. |[28l[20|[T5l [9l. 

Structure. We start with some preliminaries in Section[2] 
and present the complexity results for SPARQL fragments in 
Section[3] The subsequent discussion of query optimization 
divides into algebraic optimization (Section [4|i and seman- 
tic optimization (Section [5]). The latter discussion is com- 
plemented by the chase termination conditions presented in 
Section[6| Finally, Section[2|contains some closing remarks. 

2. Preliminaries 

RDF. We follow the notation from [26]. We consider 
three disjoint sets B (blank nodes), / (IRIs), and L (literals) 
and use the shortcut BIL to denote the union of the sets 
B, I, and L. By convention, we indicate literals by quoted 
strings (e.g. "Joe", "30") and prefix blank nodes with "_:". An 
RDF triple {vi,V2,V3) £ BI x I x BIL connects subject 
vi through predicate V2 to object v^. An RDF database, 
also called document, is a finite set of triples. We refer the 
interested reader to |[T4|| for an elaborate discussion of RDF. 

SPARQL SyntELx. Let V be an infinite set of variables 
disjoint from BIL. We start with an abstract syntax for 
SPARQL, where we abbreviate operator Optional as Opt. 
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Definition 1. We define SPARQL expressions recursively 
as follows. (1) A triple pattern t € BIV x IV x BILV 
is an expression. (2) Let Qi, Q2 be expressions and R 
a filter condition. Then Qi Filter R, Qi Union Q2, 
Qi Opt Q2, and Qi And Q2 are expressions. □ 

In the remainder of the paper we restrict our discussion 
to safe filter expressions Q Filter R, where the variables 
occurring in R form a subset of the variables in Q. As shown 
in UJ, this restriction does not compromise expressiveness. 

Next, we define SPARQL queries on top of expressions^ 

Definition 2. Let Q be a SPARQL expression and let 
S C V a finite set of variables. A SPARQL query is an 
expression of the form Select5((5). □ 



SPARQL Semantics. A mappm^ is a partial function 
V BIL from a subset of variables V to RDF terms BIL. 
The domain of a mapping /i, written dom{fi), is defined as the 
subset of V for which /i is defined. As a naming convention, 
we distinguish variables from elements in BIL through a 
leading question mark symbol. Given two mappings /ii, /i2, 
we say /ii is compatible with 112 if f^iC^x) = ^2{^x) for 
all Ix e dom{fii) f) dom{ii2)- We write ^ 112 if /^i 
and /i2 are compatible, and fii fi2 otherwise. Further, we 
write vars {t) to denote all variables in triple pattern t and by 
fi{t) we denote the triple pattern obtained when replacing all 
variables ?x G dom{iJL) n vars{t) in t by fi{?x). 

Given variables ?x, ?y and constants c, d, a filter condition 
R is either an atomic filter condition of the form hound{lx) 
(abbreviated as hnd{lx)), Ix = c, Ix —1y, or a combination 
of atomic conditions using connectives -1, A, V. Condition 
bnd{7x) applied to a mapping set 51 returns all mappings in 
il for which ?x is bound, i.e. {/i e f2 e dom{ii)}. The 
conditions 7x = c and 7x ~ly are equality checks, com- 
paring the values of Ix with c and ly, respectively. These 
checks fail whenever one of the variables is not bound. We 
write ji \= R if mapping ^ satisfies filter condition R (see 
Definition [T6l in Appendix IB . 2 1 for a formal definition). The 
semantics of SPARQL is then formally defined using a com- 
pact algebra over mapping sets (cf. 1261 ). The definition of 
the algebraic operators join ixi, union U, set minus \, left 
outer join HXI , projection tt, and selection a is given below. 

Definition 3. Let f2, f2/, Q,^ denote mapping sets, R a 
filter condition, and S dV a, finite set of variables. We 
define the algebraic operations M, U, \, 3X1 , tt, and a: 

= {fll U fj,r I fJ.1 e ^ll,l-J.r e : /i; ~ ^J.r} 
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fliUflr := {/i I /i G ri; or /i G fir} 
f2; \ fir := {/i; G fli I for all fj,r G fir 
fll 3X1 flr-.^ {fll M fir) U {fll \ fir) 

Trs{fl) := {/ii I 3/i2 : /^i U /12 G fl A dom{fii) C 5 
A dom{^i2) n 5" = 0} 

(Tfl(r!) := {m G r! I /i h ^} ° 



We follow the compositional, set-based semantics pro- 
posed in 1 26 1 and define the result of evaluating SPARQL 
query Q on document D using operator [.]]^ defined below. 

Definition 4. Let t be a triple pattern, Qi, Q2 SPARQL 

expressions, R a filter condition, and S <ZV a finite set 
of variables. The semantics of SPARQL evaluation over 
document D is defined as follows. 

{/i I dom{fi) — vars{t) and fj,{t) G D} 



[Qi And Q2iD 
[Qi Opt Q2}d 
IQi Union Q2IZ5 
IQi Filter Rjo 

|SeLECT5(Qi)]£, 



\d 



IQAd n [Q2I 
IQAd ^ IO2I 
IOiId u IO2ID 
'^rHQiId) 

^sHQiId) 



□ 



Finally, we extend the definition of function vars. Let Q 
be a SPARQL expression, A a SPARQL Algebra expression, 
and R a filter condition. By vars{A), vars{Q), and vars{R) 
we denote the set of variables in A, Q, and R, respectively. 
Further, we define function safeVars{A), which denotes the 
subset of variables in vars (A) that are inevitably bound when 
evaluating A on any document D. 

Definition 5. Let A be a SPARQL Algebra expression, 
S <Z V a finite set of variables, and R a filter con- 
dition. We define function safeVars{A) recursively on 
the structure of expression A as follows. 



safeVarsiltjo) 
safeVars{Ai 1X1 A2) 
safeVars{Ai U A2) 
safeVars{Ai \ A2) 
safeVars{Ai 3X1 A2) 
safeVars{ns{Ai)) 
safe Vars [gr {Ai)) 



= vars{t) 

= safeVars{Ai) U safeVars{A2) 

= safeVars{Ai) n safeVars{A2) 

— safeVars{Ai) 

~ safeVars{Ai) 

= safeVars{Ai) n S 

= safeVars{Ai ) □ 



Relational Databases, Constraints and Chase. 

We assume that the reader is familiar with first-order logic 
and relational databases. We denote by dom{I) the do- 
main of the relational database instance /, i.e. the set of 
constants and null values that occur in /. The constraints 
we consider, are tuple-generating dependencies (TGD) and 
equality-generating dependencies (EGD). TGDs have the 
form \/x{lp{x) 3yip{x,y)) and EGDs have the form 
Vx{ip{x) Xi = Xj). A more exact definition of these 



^We do not consider the remaining SPARQL query forms 
Ask, Construct, and Describe in this paper. 



types of constraints can be found in Appendix ID. II In the 
rest of the paper S stands for a fixed set of TGDs and EGDs. 
If an instance / is not a model of some constraint a, then we 
write / Of. 

We now introduce the chase as defined in fS^. A chase 

step I '-^ J takes a relational database instance / such that 
/ a{a) and adds tuples (in case of TGDs) or collapses some 
elements (in case of EGDs) such that the resulting relational 
database J is a model of a{a). If J was obtained from / in 
that kind, we sometimes also write la Cq instead of J. A 
chase sequence is a sequence of relational database instances 
Io,Ii, ... such that Ig+i is obtained from Ig by a chase step. 
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A chase sequence /q, ...,/„ is terminating if |= S. In 
this case, we set := /„ as the result (/^ is defined only 
unique up to homomorphic equivalence, but this will suffice). 
Otherwise, is undefined. is also undefined in case the 
chase fails. More details can be found in Appendix ID. II 
The chase does not always terminate and there has been 
different work on sufficient termination conditions. In flS] 
the following condition, based on the notion of dependency 
graph, was introduced. The dependency graph dep(I]) := 
(y, E) of a set of constraints E is the directed graph defined 
as follows. V is the set of positions that occur in E. There 
are two kind of edges in E. Add them as follows: for every 
TGD \/x{(p{x) — + 3yip{x, y)) G S and for every xinx that 
occurs in tp and every occurrence of a; in in position tti 

• for every occurrence of a; in ^/i in position 7r2, add an edge 
TTi TT2 (if it does not already exist). 

• for every existentially quantified variable y and for every 
occurrence of y in a position tt2, add a special edge 
TTi A 7r2 (if it does not akeady exist). 

A set E of TGDs and EGDs is called weakly acyclic iff 
dep(S) has no cycles through a special edge. In |8| weak 
acyclicity was lifted to stratification. Given two TGDs or 
EGDs a ~ Vxi(p, (3 = \/x2ip, we define a -< (3 (meaning 
that firing a may cause /3 to fire) iff there exist relational 
database instances /, J and a £ dom{I), b £ dom{J) s.t. 

• / ■0(&) (possibly b is not in dom{I)), 

• / — > J, and 

• JJ^ i'{b). 

The chase graph 0(2) — (S, E) of a set of constraints E 
contains a directed edge {a, (3) between two constraints iff 
a ^ (3. We call E stratified iff the set of constraints in every 
cycle of G(E) are weakly acyclic. It is immediate that weak 
acyclicity implies stratification; further, it was proven in [8] 
that the chase always terminates for stratified constraint sets. 

A Conjunctive Query (CQ) is an expression of the form 
ans{x) ^ ip{x, z), where is a CQ of relational atoms and 
X, z are tuples of variables and constants. Every variable in 
X must also occur in (p. The semantics of such a query on a 
database instance / is q{I) :— {a \ I \= 3zLp{a,J) }. 

Let q,q' be CQs and E be a set of constraints. We write 
q q' if for all database instances / such that / |= E it 
holds that q{I) C q'{I) and say that q and q' are E-equivalent 
{q =s q') if q <?' and q' Cj] q. In f9l an algorithm was 
presented that, given q and E, lists all E-equivalent minimal 
(with respect to the number of atoms in the body) rewritings 
(up to isomorphism) of q. This algorithm, called Chase & 
Backchase, uses the chase and therefore does not necessarily 
terminate. We denote its output by cb^{q) (if it terminates). 

General mathematical notation. The natural num- 
bers N do not include 0; No is used as a shortcut for N U {0}. 
For n £ N, we denote by [n] the set {1, n}. Further, for 
a set M, we denote by 2^^ its powerset. 



3. SPARQL Complexity 

We introduce operator shortcuts ^ := And, J" := Filter, 
O := Opt, U Union, and denote the class of expres- 
sions that can be constructed using a set of operators by 
concatenating their shortcuts. Further, by £ we denote the 
whole class of SPARQL expressions, i.e. £ := ATOU. The 
terms class and fragment are used interchangeably. 

We first present a complete complexity study for all possi- 
ble expression classes, which complements the study in ||251 . 
We assume the reader to be familiar with basics of complex- 
ity theory, yet summarize the background in Appendix lA.il 
to be self-contained. We follow |26 | and take the combined 
complexity of the Evaluation problem as a yardstick: 

Evaluation: given a mapping /i, a document D, and an 
expression/query Q as input: is /i £ 

The theorem below summarizes previous results from Il26l . 

Theorem 1. HHj The Evaluation problem is 

• in PTiME for class AT; membership in PTime for 
classes A and J- follows immediately, 

• NP-complete for class AJ-U, and 

• PSPACE-complete for classes AOU and £. '-' 

Our first goal is to establish a more precise characterization 
of the Union operator As also noted in |26|, its design was 
subject to controversial discussions in the SPARQL working 
groupQ, and we pursue the goal to improve the understanding 
of the operator and its relation to others, beyond the known 
NP-completeness result for class AJ-U. The following the- 
orem gives the results for all missing OPT-free fragments. 

Theorem 2. The Evaluation problem is 

• in PTiME for classes U and J-U, and 

• NP-complete for class AU. ^ 

The hardness part of the NP-completeness proof for frag- 
ment AU is a reduction from Set Cover. The interested 
reader will find details and other technical results of this 
section in Appendix |A] Theorems \T\ and |2] clarify that the 
source of complexity in OPT-free fragments is the combina- 
tion of And and Union. In particular, adding or removing 
FiLTER-expressions in no case affects the complexity. 

We now turn towards an investigation of the complexity of 
operator Opt and its interaction with other operators. The 
PSPACE-completeness results for classes AOU and AJ-OU 
stated in Theorem[T]give only partial answers to the questions. 
One of the main results in this section is the following. 

Theorem 3. Evaluation is PSPACE-complete for CD 

This result shows that already operator Opt alone makes 
the Evaluation problem really hard. Even more, it up- 
grades the claim in |26 | that "the main source of complexity 
in SPARQL comes from the combination of Union and 

^See the discussion of disjunction in Section 6.1 in 

http : //www ■ w3 . org/TR/2005/WD- rdi - sparql- query- 200502 17/ | 
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Opt operators", by showing that Union (and And) are not 
necessary to obtain PSPACE-hardness. The intuition of this 
result is that the algebra operator HXI (which is the algebraic 
counterpart of operator Opt) is defined using operators N, 
U, and \; the mix of these algebraic operations compensates 
for missing And and Union operators at syntax level. The 
corollary below follows from Theorems [T] and |3] and makes 
the complexity study of the expression fragments complete. 

Corollary 1. The Evaluation problem for any expres- 
sion fragment involving Opt is PSPACE-complete. □ 



Due to the high complexity of Opt, an interesting question 
is whether we can find natural syntactic conditions that lower 
the complexity of fragments involving Opt. In fact, a restric- 
tion of the nesting depth of Opt expressions constitutes such 
a condition. We define the OPT-rank rof an expression as its 
deepest nesting of Opt expressions: for triple pattern t, ex- 
pressions Q, and condition R, we define r{Q) recursively on 
the structure of Q as r(t) := 0, r{Qi Filter R) r{Qi), 
r{Qi And Q2)=r{Qi Union Q2) := max{r{Qi), r(Q2)), 
and r{Qi Opt Q2) := max{r{Qi), r{Q2)) + 1. 

By £<n we denote the class of expressions Q E £ with 
r{Q) < n. The following theorem shows that, when restrict- 
ing the OPT-rank of expressions, the Evaluation problem 
falls into a class in the polynomial hierarchy. 

Theorem 4. For any n e No, the Evaluation problem 
is S^^j^-complete for the SPARQL fragment £<n- n 

Observe that Evaluation for class £<:q is complete for 
Ef =NP, thus obtaining the result for OPT-free expressions 
(cf. Theorem[TJ. With increasing nesting-depth of Opt ex- 
pressions we climb up the polynomial hierarchy (PH). This is 
reminiscent of the VALiDiTY-problem for quantified boolean 
formulae, where the number of quantifier alternations fixes 
the complexity class in the PH. In fact, the hardness proof 
(see Appendix |A.3t makes these similarities explicit. 

We finally extend our study to SPARQL queries, i.e. frag- 
ments involving top-level projection in the form of a Se- 
LECT-operator (see Def. |2]). We extend the notation for 
classes as follows. Let F be an expression fragment. We 
denote by the class of queries of the form Select5((5), 
where 5* C is a finite set of variables and Q e F is an ex- 
pression. The next theorem shows that we obtain (top-level) 
projection for free in fragments that are at least NP-complete. 

Theorem 5. Let C be a complexity class and F a class 
of expressions. If Evaluation is C-complete for F and 
C D NP then EVALUATION is also C-complete for F+.D 

In combination with Corollary [T] we immediately obtain 
PSPACE-completeness for query classes involving opera- 
tor Opt. Similarly, all OPT-free query fragments involv- 
ing both And and Union are NP-complete. We conclude 
our complexity analysis with the following theorem, which 
shows that top-level projection makes the Evaluation 
problem for AND-only expressions considerably harder. 

Theorem 6. Evaluation is NP-complete for A+. □ 



4. SPARQL Algebra 

We next present a rich set of algebraic equivalences for 
SPARQL Algebra. In the interest of a complete survey we 
include equivalences that have been stated before in fSSlH 
Our main contributions in this section are (a) a systematic 
extension of previous rewriting rules, (b) a correction of an 
erratum in |26|, and (c) the development and discussion of 
rewriting rules for SPARQL expressions involving negation. 

We focus on two fragments of SPARQL algebra, namely 
the full class of algebra expressions A (i.e., algebra expres- 
sions build using operators U, X, \, IM , tt, and cr) and the 
union- and projection-free expressions (build using only 
operator M, \, HXI , and cr). We start with a property that 
separates from A, called incompatibility property^ 

Proposition 1. Let ft be the mapping set obtained from 
evaluating an A^-expression on any document D. All 
pairs of distinct mappings in Q are incompatible. □ 

Figure [TJl-IV) surveys rewriting rules that hold with re- 
spect to common algebraic laws (we write A = B if A is 
equivalent to B on any document D). Group I contains 
results obtained when combining an expression with itself 
using the different operators. It is interesting to see that (JI- 
dem) and (LIdem) hold only for fragment A~; in fact, it 
is the incompatibility property that makes the equivalences 
valid. The associativity and commutativity rules were in- 
troduced in [261 and we list them for completeness. Most 
interesting is distributivity. We observe that N, \, 3X1 are 
right-distributive over U, and Xl is also left-distributive over 
U. The listing in Figure[T]is complete in the following sense: 

Lemma 1. Let d { N, \, 3X1 } and O2 d U {U}. 

• The two equivalences (JIdem) and (LIdem) in gen- 
eral do not hold for fragments larger than A~. 

• Associativity and Commutativity do not hold for 
operators \ and 3X1 . 

• Neither \ nor 3X1 are left-distributive over U. 

• Let Oi S Oi, 02 e O2, and oi ^ 02. Then 02 is 
neither left- nor right-commutative over oi. □ 

Cases (3) and (4) rule out distributivity for all operator 
combinations different from those listed in Figure [T] This 
result implies that Proposition 1(3) in |26J is wrong: 

Example 1. We show that the SPARQL equivalence 

Ai Opt (A2 Union A3) = {Ai Opt A2) Union {Ai Opt A3) 

stated in Proposition 1(3) in ^26] does not hold in 
the general case. We choose database D={(0, c, 1)} and 
set Ai = (0, c, la), A2 = {la, c, 1), and ^3 = (0, c, lb). Then 
{Ai Opt {A2 Union A3)1d = {{la ^ ^ 1}}, 

but \{Ai Opt A2) Union {Ai Opt As)]/? evaluates to 
{{la ^ 1}, {la ^l,lb^ 1}}. The results dilfer. □ 

^Most equivalences in [26] were established at the syntactic 
level. In summary, rule (MJ) in Proposition [5] and a bou t 
half of the equivalences in Figure [T] are borrowed from [26] . 
We indicate these rules in the proofs in Appendix iBl 
^Lemma 2 in [26| also builds on this observation. 
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I. Idempotence and Inverse 

A\JA = A 
A- N A- = A" 
A~ -iA A^ = A^ 
A\A =0 

II. Associativity 

(AlUyl2)UA3 =AiU{A2UA3) 
(Ai N A2) M A3 = Ai X (A2 M A3) 

III. Commutativity 

Ai U A2 = A2 U Ai 
Ai tx A2 = A2 txl Ai 

IV. Distributivity 

(Ai U A2) N A3 = (Ai M A3) U (A2 M A3) 
Ai txl (A2 U A3) = (Ai N A2) U (Ai N A3) 

(AiUA2)\A3 = (Ai\A3)U(A2\A3) 
(Ai u A2) A3 = (Ai lA A3) u (A2 "M A-i 



A3) 



(Uldem) 
( J Idem ) 
(LIdem) 
(Inv) 



(UAss) 
(J Ass) 



(UComm) 
( JComm) 



(JUDistR) 
(JUDistL) 
(MUDistR) 
(LUDistR) 



V. Filter Decomposition and Elimination 

(Tfl(AiUA2) = (T^j(Ai) U(Tfl(A2) 



0"iilAii2(A-) 
0"fllVii2(A) 

o-fli (o-fla (A)) 
f^indCrxjCAl) 

If ?a; £ sa/eVars(A2) \ ?)ars(Ai), then 
o-6«d(?^)(Ai 3^ A2) = Ai M A2 



o-fii (0-^2 (A)) 
= aflJA)Uf7H,(A) 
= 0-^2 (o-flj (A)) 
= Ai, if Ix £ sa/eVars(Ai) 
= 0, if ?a; ^ ■uars(Ai) 
= 0, if ?a; e safeVars{A-i) 
= Ai, if ?x ^ i;ars(Ai) 



(SUPush) 
(SDecompI) 
( SDecompII) 
(SReord) 

(Bndl) 
(Bndll) 
(Bndlll) 
(BndIV) 

(BndV) 



VI. Filter Pushing 

The following rules hold if vars{R) C safeVars{Ai ). 

(Tfl(Ai N A2) = crij(Ai) N A2 (SJPush) 
<yRi,Ai \ A2) = afl(Ai) \ A2 (SMPush) 
(jfl(Ai A2) = crfl(Ai) A2 (SLPush) 



Figure 1: SA equivalences for A-expr. A, Ai, yl2, A3; A -expr. A ; filter condition R\ variable Ix. 



Remark 1. This erratum calls the existence of the union 
normal form stated in Proposition 1 in [55] into ques- 
tion, as it builds upon the invalid equivalence. We ac- 
tually do not see how to fix or compensate for this rule, 
so it remains an open question if such a union normal 
form exists or not. The non-existence would put differ- 
ent results into perspective, since - based on the claim 
that Union can always be pulled to the top - the au- 
thors restrict the subsequent discussion to UNiON-free 
expressions. For instance, results on well-defined pat- 
terns, normalization, and equivalence between compo- 
sitional and operational semantics are applicable only 
to queries that can be brought into union normal form. 
Arguably, this class may comprise most of the SPARQL 
queries that arise in practice (queries without union or 
with union only at the top-level also constitute very fre- 
quent patterns in other query languages, such as SQL). 
Still, a careful reinvestigation would be necessary to ex- 
tend the results to queries beyond that class. □ 

Figure[nv-VI) presents rules for decomposing, eliminat- 
ing, and rearranging (parts of) filter conditions. In combina- 
tion with rewriting rules I-IV they provide a powerful frame- 
work for manipulating filter expressions in the style of RA fil- 
ter rewriting and pushing. Most interesting is the use of safe- 
Vars as a sufficient precondition for ( SJPush ), ( SMPush ), 
and (SLPush)^ The need for this precondition arises from 
the fact that joins over mappings are accepting for unbound 
variables. In RA, where joins over null values are rejecting, 
the situation is less complicated. For instance, given two RA 
relations Ai, A2 and a (relational) filter i?, (SJPush) is appli- 
cable whenever the schema of Ai contains all attributes in R. 
We conclude this discussion with the remark that, for smaller 
fragments of SR4RQL conditions, weaker preconditions for 
the rules in group VI exist. For instance, if i? = ei A • • • A e„ 
is a conjunction of atomic equalities ei, . . . , e„, then the 
equivalences in group VI follow from the (weaker) condition 
vars{R) C vars{A) A vars{B) n vars{R) C safeVars{A). 



We pass on a detailed discussion of operator tt, also be- 
cause - when translating SPARQL queries into algebra ex- 
pression - this operator appears only at the top-level. Still, 
we emphasize that also for this operator rewriting rules exist, 
e.g. allowing to project away unneeded variables at an early 
stage. Instead, in the remainder of this section we will present 
a thorough discussion of operator \. The latter, in contrast 
to the other algebraic operations, has no direct counterpart at 
the syntactic level. This complicates the encoding of queries 
involving negation and, as we will see, poses specific chal- 
lenges to the optimization scheme. We start with the remark 
that, as shown in 1 1 1, operator \ can always be encoded at 
the syntactic level through a combination of operators Opt, 
Filter, and bnd. We illustrate the idea by example. 

Example 2. The following SPARQL expression Qi and 
the corresponding algebra expression Ai select all per- 
sons for which no name is specified in the data set. 

Qi = FiLTER^i,„d(7„)((?p,iype, Person) Opt 

({Ip, type, Person) And (Ip, name, In))) 
M = o-^i,„d(7„)([(?p,tOTe, -Person)] 

p, type. Per son)\ M name, ?n)])) |— | 



From an optimization point of view it would be desirable to 
have a clean translation of this constellation using operator \, 
but the semantics maps Qi into Ai, which contains operators 
(7, 3X1 , M, and predicate bnd, rather than \. In fact, a better 
translation (based on \) exists for a class of practical queries 
and we will provide rewriting rules for such a transformation. 

Proposition 2. Let Ai, A2 be A-expressions and 

A^ be A^-expressions. The following equivalences hold. 



(Ai\A2)\A3 
(Ai\A2)\A3 
Ai\A2 
A- ixi A- 



(Ai\A3)\A2 
Ai \ (A2 U A3) 
Ai \ (Ai N A2) 
A- lA (A- X A-) 



(MReord) 
(MMUCorr) 
(MJ) 
(LJ) 



□ 



^A variant of rule (SJPush), restricted to AND-only queries, 
has been stated (at syntax level) in Lemma 1(2) in j26j. 



Rules (MReord) and (MMUCorr) are general -purpose 
rewriting rules, listed for completeness. Most important in 
our context is rule (LJ). It allows to eliminate redundant 
subexpressions in the right side of 3X1 -expressions (for A~ 
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expressions), e.g. the application of (LJ) simplifies Ai to 
A'i= (J^bnd{?n) type, Person)} Iixi |(?p, name, ?n)|). 
The following lemma allows for further simplification. 

Lemma 2. Let , A^ be A~-expressions, R a filter 
condition, and Ix S sa]eYars(^A2) \ vars(^A\) a variable. 
Then cr^j„d(7^)(Aj; lA A^) = \ ^2" holds. □ 

The application of the lemma to query A-^ yields the expres- 
sion A" = iype, Person)\ \ |(?p, name, Com- 
bined with rule (LJ), we have established a powerful mech- 
anism that often allows to make simulated negation explicit. 

5. Semantic SPARQL Query Optimization 

This chapter complements the discussion of algebraic opti- 
mization with constraint-based, semantic query optimization 
(SQO). The key idea of SQO is to find semantically equiv- 
alent queries over a database that satisfies a set of integrity 
constraints. These constraints might have been specified by 
the user, extracted from the underlying database, or hold im- 
plicitly when SPARQL is evaluated on RDFS data coupled 
with an RDFS inference system^ More precisely, given a 
query Q and a set of constraints S over an RDF database 
D s.t. 13 1= E, we want to enumerate (all) queries Q' that 
compute the same result on D. We write Q =y. Q' if Q 
is equivalent to Q' on each database D s.t. D ^ S. Fol- 
lowing previous approaches |9|, we focus on TGDs and 
EGDs, which cover a broad range of practical constraints 
over RDF, such as functional and inclusion dependencies. 
When talking about constraints in the following we always 
mean TGDs or EGDs. We refer the interested reader to fT9l 
for motivating examples and a study of constraints for RDF. 
We represent each constraint a 6 E by a first-order logic 
formula over a ternary relation To{s,p,o) that stores all 
triples contained in RDF database D and use T as the cor- 
responding relation symbol. For instance, the constraint 
yxi,X2{T{xi,pi,X2) — > 3yiT{xi,p2,yi)) states that each 
RDF resource with property pi also has property p2- Like in 
the case of conjunctive queries we call a A+ query minimal 
if there is no equivalent A+ query with fewer triple patterns. 

Our approach relies on the Chase & Backchase (C&B) 
algorithm for semantic optimization of CQs proposed in 1 9 1 . 
Given a CQ q and a set E of constraints as input, the algorithm 
outputs all semantically equivalent and minimal q' =^ q 
whenever the underlying chase algorithm terminates. We 
defer the discussion of chase termination to the subsequent 
section and use the C&B algorithm as a black box with 
the above properties. Our basic idea is as follows. First, we 
translate AND-only blocks (or queries), so-called basic graph 
patterns (BGPs), into CQs and then apply C&B to optimize 
them. We introduce two alternate translation schemes below. 

Definition 6. Let 5 C V" be a finite set of variables and 
Q G A+ be a SPARQL query defined as 

Q = Selects((si,pi,oi) And ... And (s„,p„,o„)). 

®Note that the SPARQL semantics disregards RDFS infer- 
ence, but assumes that it is realized in a separate layer. 



We define the translation Ci{Q) := q, where 

q : ans{s) <— T{si,pi,Oi), . . . , r(s„,p„, o„), 

and s is a vector of variables containing exactly the 
variables in S. We define Ci^{q) as follows. It takes a 
CQ in the form of q as input and returns Q if it is a valid 
SPARQL query, i.e. if G BIV, p, G IV, o, G BILV 
for all i G [n]; otherwise, C]~^(q) is undefined. □ 

Definition 7. Let E be a set of RDF constraints, D an 
RDF database, and a — Vx{(j){x) 3y^p{x,y)) G E. 
We use h(T{ai, 02, a^)) := a2(ai, 03) if 02 is not a vari- 
able, otherwise we set it to the empty string. For a con- 
junction Ar=i ^(^i) of atoms, we set /i(Ar=i ^(^i)) •= 
Ar=i ^{T(fli))- Then, we define the constraint a' as 
Vx{h{(j){x)) 3yh{'f{x,y))). We set E' := {a' | a G E} 
if all a' are constraints, otherwise E' :— 0. 

Let S GV he a, set of variables, Q G A+ defined as 

g = Selects((si,pi,oi) And ... And (s„,p„,o„)), 

and assume that pi is never a variable. We define the 
translation C2((9) := ans{s) ^ pi(si, oi), ...,p„(s„, o„), 
where vector s contains exactly the variables in S. For 
a CQ q: ans{s) ^ i?i(a;ii, a;i2), i?„(x„i, a;„2), we de- 
note by C^^(q) the expression 

SELECTs((xii,i?i,xi2) And ... And (a;„i, i?„, a;„2)) 

if it is a SPARQL query, else C2^{q) is undefined. □ 

Ci{Q) andC]"^(Q) constitute straightforward translations 
from SPARQL AND-only queries to CQs and back. The defi- 
nition of C2(Q) andC^^iQ) was inspired by the work in 11131 
and is motivated by the observation that in many real-world 
SPARQL queries variables do not occur in predicate position; 
it is applicable only in this context. Given that the second 
translation scheme is not always applicable, the reader may 
wonder why we introduced it. The reason is that the trans- 
lation schemes are different w.r.t. the termination conditions 
for the subsequent chase that they exhibit. We will come 
back to this issue when discussing termination conditions for 
the chase in the next section (see Proposition |4]i. 

The translation schemes, although defined for A+ queries, 
directly carry over to ^-expressions, i.e. each expression 
Q ^ A can be rewritten into the equivalent ^+ -expression 
SELECT„ars(Q)(Q)- Coupled with the C&B algorithm, they 
provide a sound approach to semantic query optimization for 
AND-only queries whenever the underlying chase algorithm 
terminates, as stated by the following lemma. 

Lemma 3. Let Q an ^-(--expression, D a database, and 
E a set of EGDs and TGDs. 

• If c6s(Ci(Q)) terminates then VQ' G ^-1-: 

Q' G C^\cb^{Ci{Q))) ^Q' =^Q and Q' minimal. 

• If C2{Q) is defined, |E'| = |E| and c6s'(C2(Q)) ter- 
minates then so does c65](Ci(Q)). 

• If C2{Q) is defined, |E'| = |E| and c6s'(C2(Q)) ter- 
minates then VQ' G A+ : 

Q'GCfi(c6E(Ci(g)))^g'=EQ. □ 
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The converse direction in bullets one and three does not 
hold in general, i.e. the scheme is not complete. Before we 
address this issue, we illustrate the problem by example. 

Example 3. Let the two expressions Qi :— {?x,b,' I'), 
Q2 '■= 6/Z') And (?a;,a, c), and let the constraint 
Vxi,a:2,a;3(T(a;i,a;2,X3) T{xz,X2,xi)) be given. By 
definition, there are no RDF databases that contain a 
literal in a predicate position because, according to the 
constraint, such a literal would also occur in the subject 
position, which is not allowed. Therefore, the answer 
to both expressions Qi and Q2 is always the empty set, 
which implies Qi =2 Q2. But it is easy to verify that 
Ci[Qi) =s Cx{Q2) does not hold. The reason for this 
discrepancy is that the universal plan [3] of the queries 
is not a valid SPARQL query. □ 

We formalize this observation in the next lemma, i.e. pro- 
vide a precondition that guarantees completeness^ For a 
CQ q, we denote by U{q) its universal plan [9|, namely the 
conjunctive query q' obtained from q by chasing its body. 

Lemma 4. Let D be a database and Q an ^+-expression 
such that C^^{U{Ci{Q))) e A+. 

• If c6s(Ci((5)) terminates then VQ' G A+ such that 

Q' G Cf ^(c6s(Ci(Q))) <^ Q' =s Q and Q' minimal. 

• If C2{Q) is defined, [S'l = |S| and c6s'(C2(Q)) ter- 
minates then VO'G^+ s.t. C^^{U{Ci{Q'))) G A+: 
Q' G Cf^(c6s(Ci(g))) ^ g' =s QandQ' minimal.D 

By now we have established a mechanism that allows us to 
enumerate equivalent queries of SPARQL AND-only queries, 
or BGPs inside queries. Next, we provide extensions that go 
beyond AND-only queries. The first rule in the following 
lemma shows that sometimes Opt can be replaced by And; 
informally spoken, it applies when the expression in the Opt 
clause is implied by the constraints. The second rule can be 
used to eliminate redundant BGPs in OPT-subexpressions. 

Lemma 5. Let Qi,Q2,Q3 G A and S CV a finite set 
of variables. 

• If Qi EEs SELECT™^,(Q^)(gi And Q2) then 
Qi Opt g2 =s gi And g2. 

• If Qi =s Qi And g2 then 

{Qi Opt (g2 And Q3)) =s Qi Opt Q3. □ 

Note that the preconditions are always expressed in terms 
of AND-only queries and projection, thus can be checked 
using our translation schemes and the C&B algorithm. We 
conclude our discussion of SQO with a lemma that gives 
rules for the elimination of redundant filter expressions. 

Lemma 6. Let Qi,Q2 e A, S c V\{?y} a set of vari- 
ables, 1x,?y G vars{Q2), S a set of constraints, and D 
a documents s.t. D \= 'S. Further let Q2-^ be obtained 
from g2 by replacing each occurrence of 7y by 7x. 

^We rnight expect that situations as the one sketched in Ex- 
ample|3]occur rarely in practice, so the condition in Lemma|4] 
may guarantee completeness in most practical scenarios. 



• If Qi =s SELECTyars{Qi){Ql AND Q2) then 
lFlLTER^t,ndC>x){Ql OpT g2)lD = 0- 

• If SELECTs(g2) =s SELECTs(g2|f ) then 

Selects (FiLTER7:^:=7j^(g2)) ees SELECTs(g2||). 

• If SELECT5(g2) EEs SELECTs(g2||) then 
|FlLTER^7^=7j,(g2)]D =0. □ 

We conclude this section with some final remarks. First, 
we note that semantic optimization strategies are basically 
orthogonal to algebraic optimizations, hence both approaches 
can be coupled with each other. For instance, we might 
get better optimization results when combining the rules for 
filter decomposition and pushing in FigurelTfV-VI) with the 
semantic rewriting rules for filter expressions in the lemma 
above. Second, as discussed in |9|, the C&B algorithm can 
be enhanced by a cost function, which makes it easy to factor 
in cost-based query optimization approaches for SPARQL, 
e.g. in the style of |23|. This flexibility strengthens the 
prospectives and practicability of our semantic optimization 
scheme. The study of rewriting heuristics and the integration 
of a cost function, though, is beyond the scope of this paper. 

6. Chase Termination 

The applicability of the C&B algorithm, and hence of our 
SQO scheme presented in the previous section, depends on 
the termination of the underlying chase algorithm. Given 
an arbitrary set of constraints it is in general undecidable if 
the chase terminates for every database instance |8|; still, in 
the past several sufficient termination conditions have been 
postulated lUS [TOl H] |28l H . The strongest sufficient con- 
ditions known so far are weak acyclicity f28 |, which was 
strictly generalized to stratification in |8 |, raising the recog- 
nition problem from P to CONP. Our SQO approach on top 
of the C&B algorithm motivated a reinvestigation of these 
termination conditions, and as a key result we present two 
novel chase termination conditions for the classical frame- 
work of relational databases, which empower virtually all 
applications that rely on the chase. Whenever we men- 
tion a database or a database instance in this section, we 
mean a relational database. We will start our discussion 
with a small example run of the chase algorithm. The ba- 
sic idea is simple: given a database and a set of constraints 
as input, it fixes constraint violations in the database in- 
stance. Consider for example database {R{a, b)} and con- 
straint Vxi, 2:2 (-R(a;i, 2:2) — > 3yR{x2,y))- The chase first 
adds R{b, yi) to the instance, where yi is a fresh null value. 
The constraint is still violated, because j/i does not occur 
in the first position of an i?-tuple. So, the chase will add 
R{yi,y2), R{y2,y3), R{y3,y4), ... in subsequent steps, 
where 2/2,2/3,2/4, ■ ■ ■ are fresh null values. Obviously, the 
chase algorithm will never terminate in this toy example. 

Figure |2] summarizes the results of this section and puts 
them into context. First, we will introduce the novel class 
of safe constraints, which guarantees the termination of the 
chase. It strictly subsumes weak acyclicity, but is different 
from stratification. Building upon the definition of safety, we 
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Figure 2: Chase termination conditions. 

then present safely restricted constraints as a consequent ad- 
vancement of our ideas. The latter class strictly subsumes all 
remaining termination conditions known so far Finally, we 
will show that, based on our framework, we can easily define 
another class called safely stratified constraints, which is 
strictly contained in the class of safely restricted constraints, 
but also subsumes weak acyclicity and safeness. 

Safe Constraints. The basic idea of the first termina- 
tion condition is to keep track of positions a newly introduced 
labeled null may be copied to. Consider for instance con- 
straint R{xi,X2,X3), S{x2) 3yR{x2, y, 2^1), which is not 
weakly acyclic. Its dependency graph is depicted in Figure 
|3](left). As illustrated in the toy example in the beginning 
of this section, a cascading of labeled nulls (i.e. new labeled 
null values that are created over and over again) may cause a 
non-terminating chase sequence. However, we can observe 
that for the constraint above such a cascading of fresh labeled 
nulls cannot occur, i.e. no fresh labeled null can repeatedly 
create new labeled nulls in position while copying itself 
to position R^. The reason is that the constraint cannot be 
violated with a fresh labeled null in R^, i.e. if i?(ai, a2, a^) 
and 5(02) hold, but 3yR(a2, y, ai) does not, then 02 is never 
a newly created labeled null. This is due to the fact that a2 
must also occur in relation S, which is not modified when 
chasing only with this single constraint. Consequently, the 
chase sequence always terminates. We will later see that this 
is not a mere coincidence: the constraint is safe. 

To formally define safety, we first introduce the notion of 
affected positions. Intuitively, a position is affected if, during 
the application of the chase, a newly introduced labeled null 
can be copied or created in it. Thus, the set of affected 
positions is an overestimation of the positions in which a 
null value that was introduced during the chase may occur 

Definition 8. g]/ Let S be a set of TGDs. The set of 

affected positions afF(I]) of E is defined inductively as 
follows. Let TT be a position in the head of an a G E. 

• If an existentially quantified variable appears in tt, 
then TT E a,S{T.). 

• If the same universally quantified variable X ap- 
pears both in position tt, and only in affected po- 
sitions in the body of a, then tt G aff(I]). □ 

Although we borrow this definition from |4| our focus is 
different. We extend known classes of constraints for which 
the chase terminates. The focus in |4] is on query answering 
in cases the chase may not terminate. Our work neither 



Figure 3: Left: Dependency graph. Right: Cor- 
responding propagation graph (it has no edges). 

subsumes dH nor the other way around. Like in the case of 
weak acyclicity, we define the safety condition with the help 
of the absence of cycles containing special edges in some 
graph. We call this graph propagation graph. 

Definition 9. Given a set of TGDs S, the propagation 
graph prop(I]) := (aff(S),i?) is the directed graph de- 
fined as follows. There are two kinds of edges in E. Add 
them as follows: for every TGD Va;((/)(a;) — > ^yiplx^y)) G 
S and for every x inx that occurs in ^ and every oc- 
currence of x in in position tti 

• if a: occurs only in affected positions in (p then, for 
every occurrence oi x in ip in position tt2, add an 
edge TTi — > TT2 (if it does not already exist). 

• if a: occurs only in affected positions in then, for 
every existentially quantified variable y and for ev- 
ery occurrence of ?/ in a position tt2, add a special 
edge TTi A TT2 (if it does not already exist). □ 

Definition 10. A set S of constraints is called safe iff 
prop(E) has no cycles going through a special edge. □ 

The intuition of these definitions is that we forbid an un- 
restricted cascading of null values, i.e. with the help of the 
propagation graph we impose a partial order on the affected 
positions such that any newly introduced null value can only 
be created in a position that has a higher rank in that partial 
order in comparison to null values that may occur in the body 
of a TGD. To state this more precisely, assume a TGD of the 
form yx{(f){x) — > 3yijj{x,y)) is violated. Then, / ^ 0(a) 
and / 3yip{a, y)) must hold. The safety condition ensures 
that any position in the body that has a newly created labeled 
null from a in itself and also occurs in the head of the TGD 
has a strictly lower rank in our partial order than any position 
in which some element from y occurs. The main difference 
in comparison to weak acyclicity is that we look in a refined 
way (see affected positions) on where a labeled null can be 
propagated to. We note that given a set of constraints it can 
be decided in polynomial time whether it is safe. 

Example 4. Consider the TGD R{xi,X2,X3), S{x2) 
3yR(x2,y,xi) from before. The dependency graph is 
depicted in Figure [3] on the left side and its propagation 
graph on the right side. The only affected position is 
R^. From the respective definitions it follows that this 
constraint is safe, but not weakly acyclic. □ 

Note that if S is safe, then every subset of S is safe, too. 
We will now compare safety to other termination conditions. 
In the example, the propagation graph is a subgraph of the 
dependency graph. This is not a mere coincidence. 
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Theorem 7. Let E be a set of constraints. 

• Then, prop(I]) is a subgraph of dep(S). It holds 
that if E is weakly acyclic, then it is also safe. 

• There is some E that is safe, but not stratified. 

• There is some E that is stratified, but not safe. □ 

The next result shows that safety guarantees termination 
while retaining polynomial time data complexity. 

Theorem 8. Let E be a fixed set of safe constraints. 
Then, there exists a polynomial Q G N[X] such that for 
any database instance /, the length of every chase se- 
quence is bounded by (5(||/||), where ||/|| is the number 
of distinct values in /. □ 

Safely Restricted Constraints. In this section we 
generalize the method of stratification from 1 8 1 to a condition 
which we call safe restriction. The chase graph from ||8l 
will be a special case of our new notion. We then define 
the notion of safe restriction and show that the chase always 
terminates for constraints obeying it. 

Let a := S{x2tX^), R{xi,X2,xz) 3yR{x2,y,xi) and 
P := R{xi, X2,X3) S{xi,X3). It can be seen that a ^ j3 
and (3 < a. Further, {a, (3} is not weakly acyclic, so it 
follows that {a, /?} is not stratified. Still, the chase will 
always terminate: A firing of a may cause a null value to 
appear in position R^, but a firing of (3 will never introduce 
null values in the head of /3 although (3 < a holds. This is 
the key observation for the upcoming definitions. First, we 
will refine the relation -< from [8 1. This refinement helps us 
to detect if during the chase null values might be copied to 
the head of some constraint. Let pos(E) denote the set of 
positions that occur in the body of some constraint in E. 

Definition 11. Let E a set of constraints and P C pos(E). 
For all a,/3 S E, we define a /3 iff there are tuples 
a, b and a database instance / s.t. 

• / Qf(a), 

• (3 \s not applicable on b and /, 

• la® Cc,^ 13(b), 

• null values in / occur only in positions from P, and 

• the firing of /? in the case of bullet three copies some 
null value from la © Ca to the head of (3. □ 

We next introduce a notion for affected positions relative 
to a constraint and a set of positions. 

Definition 12. For any set of positions P and tgd a let 

aff-cl(a, P) be the set of positions vr from the head of a 
such that either 

• the variable in tt occurs in the body of a only in 
positions from P or 

• TT contains an existentially quantified variable. □ 

The latter definition and the refinement of -< will help us 
to define the notion of a restriction system, which is a strict 
generalization of the chase graph introduced in ||8l . 



Definition 13. A restriction system is a pair (G"(E), /), 
where G"(E) := (E, i?) is a directed graph and / : E ^ 
2pos(S) jg ^ function such that 

• foraU TGDs a and foraU (a, /3) e E: 
afi^-cl(a,/(a))npos({/3}) C/(/3), 

• forall EGDs a and forall (a,/3) G E: 
/(a)npos({/3})C/(/3), and 

• foraU a, ^ G E: a <f(a) P =^ ^ E. □ 

We illustrate this definition by an example. It also shows 
that restriction systems always exist. 

Example 5. Let E a set of constraints. Then, (G(E),/), 
where /(a) :— pos{{a}) for all a G E is a restriction 
system for E. □ 

Based on the novel technical notion of restriction systems 
we can easily define a new class of constraints. 

Definition 14. E is called safely restricted if and only if 
there is a restriction system (G"(E),/) for E such that 
every strongly connected component in G'(E) is safe. □ 

The next theorem shows that safe restriction strictly ex- 
tends the notion of stratification and safety. 

Theorem 9. If E is stratified or safe, then it is also safely 
restricted. There is some E that is safely restricted but 
neither safe nor stratified. □ 

Definition [T4l implies that safely restricted constraints can 
be recognized by a El'-algorithm. However, with the help 
of a canonical restriction system, we can show that safe 
restriction can be decided in CONP (like stratification). 

Theorem 10. Given constraint set E it can be checked 
by a CONP-algorithm whether E is safely restricted. □ 

The next theorem is the main contribution of this section. 
It states that the chase will always terminate in polynomial 
time data complexity for safely restricted constraints. 

Theorem 11. Let E be a fixed set of safely restricted 
constraints. Then, there exists a polynomial Q G N[X] 
such that for any database instance /, the length of 
every chase sequence is bounded by where ||/|| 

is the number of distinct values in /. □ 

To the best of our knowledge safe restriction is the most 
general sufficient termination condition for TGDs and EGDs. 
We finally compare the chase graph to restriction systems. 
The reader might wonder what happens if we substitute weak 
acyclicity with safety in the definition of stratification (in the 
preliminaries). 

Definition 15. We call E safely stratified iff the con- 
straints in every cycle of G(E) are safe. □ 

We obtain the following result, showing that with the help 
of restriction systems, we strictly extended the method of the 
chase graph from \^ . 
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Theorem 12. Let E be a set of constraints. 

• If S is weakly acyclic or safe, then it is safely strat- 
ified. 

• If E is safely stratified, then it is safely restricted. 

• There is some set of constraints that is safely re- 
stricted, but not safely stratified. □ 

Note that we used safety instead of safe stratification in 
the definition of safe restrictedness ahhough safely stratified 
constraints are the provably larger class. This is due to the 
fact that safety is easily checkable and would not change the 
class of constraints. The next proposition clarifies this issue. 

Proposition 3. E is safely restricted iff there is a restric- 
tion system (G'(E),/) for E such that every strongly 
connected component in G'(E) is safely stratified. □ 

In the previous section we proposed two SPARQL transla- 
tion schemes and it is left to explain why we introduced two 
alternative schemes. The next proposition states that the two 
schemes behave differently with respect to safe restriction. 

Proposition 4. Let E be a non-empty set of constraint 
set over a ternary relation symbol T. 

• There is some E that is safely restricted, but E' = 0, 
i.e. the second translation scheme is not applicable. 

• There is some E such that |E| = |E'| and E' is safely 
restricted, but E is not. □ 

Referring back to Lemma [3] this means we might check 
both E or E' for safe restrictedness, and can guarantee termi- 
nation of the chase if at least one of them is safely restricted. 

7. Conclusion 

We have discussed several facets of the SPARQL query 
language. Our complexity analysis extends prior investi- 
gations [261 and (a) shows that the combination of And 
and Union is the main source of complexity in OPT-free 
SPARQL fragments and (b) clarifies that yet operator Opt 
alone makes SPARQL evaluation PSPACE-complete. We 
also show that, when restricting the nesting depth of Opt- 
expressions, we obtain better complexity bounds. 

The subsequent study of SPARQL Algebra lays the foun- 
dations for transferring established Relational Algebra opti- 
mization techniques into the context of SPARQL. Addition- 
ally, we considered specifics of the SPARQL query language, 
such as rewriting of SPARQL queries involving negation. 
The algebraic optimization approach is complemented by a 
powerful framework for semantic query optimization. We 
argue that a combination of both algebraic and semantic 
optimization will push the limits of existing SPARQL im- 
plementations and leave the study of a schematic rewriting 
approach and good rewriting heuristics as future work. 

Finally, our results on chase termination empower the prac- 
ticability of SPARQL query optimization in the presence of 
constraints and directly carry over to other applications that 
rely on the chase, such as [28. .20., 15. .9J. 
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APPENDIX 

A. Proofs of the Complexity Results 

This section contains the complexity proofs of the Eval- 
uation problem for the fragments studied in Section[3] We 
refer the interested reader to Il26l for the proof of Theorem[T] 
We start with some basics from complexity theory. 

A.l Background from Complexity Theory 

Complexity Classes. As usual, we denote by PTime 
(or P, for short) the complexity class comprising all problems 
that can be decided by a deterministic Turing Machine (TIVI) 
in polynomial time, by NP the set of problems that can be 
decided by a non-deterministic TM in polynomial time, and 
by PSpace the class of problems that can be decided by a 
deterministic TIVI within polynomial space bounds. 

The Polynomial Hierarchy 

Given a complexity class C we denote by COC the set of 
decision problems whose complement can be decided by a 
TM in class C. Given complexity classes Ci and C2, the 
class captures all problems that can be decided by a TM 
Ml in class Ci enhanced by an oracle TM M2 for solving 
problems in class C2. Informally, engine Mi can use M2 
to obtain a yes/no-answer for a problem in C2 in a single 
step. We refer the interested reader to ||2| for a more formal 
discussion of oracle machines. Finally, we define the classes 
Sf and recursively as 

S^" = :=P and Y,^^^^:=NP^" , and put 
n,^+i:=CONP^". 

The polynomial hierarchy PH ||30ll is then defined as 

It is folklore that Sf = coHf , and that Sf C nf^^ 
and nf C Sf^]^ holds. Moreover, the following inclusion 
hierarchies for and are known. 

P = C NP = Sf C C • • • C PSPACE, and 

p = c coNP = nf c nf c ■ • • c pspace. 

Complete Problems 

We consider completeness only with respect to polynomial- 
time many-one reductions. QBF, the tautology test for quan- 
tified boolean formulas, is known to be PSPACE-complete 
||2l. Forms of QBF with restricted quantifier alternation are 
complete for classes Ilf or Sf depending on the question if 
the first quantified of the formula is V or 3. A more thorough 
introduction to complete problems in the polynomial hierar- 
chy can be found in |2|. Finally, the NP-completeness of 
the SETCovER-problem and the 3 SAT-problem is folklore. 
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A.2 OPT-free Fragments (Theorem|2) 

Fragment U: UNION (TheoremSl)) 

For a UNiON-only expression P and data set D it suffices 
to check if G \i\D for any triple pattern t in P. This can 
easily be achieved in polynomial time. □ 

Fragment TU: FILTER + UNION (TheoremU;!)) 

We present a PTiME-algorithm that solves the Evaluation 
problem for this fragment. It is defined on the structure of 
the input expression P and returns true if /i G I-Pjc false 
otherwise. We distinguish three cases, (a) If P = < is 
a triple pattern, we return true if and only if /i G |<]d- 
(b) If P = Pi Union P2 we (recursively) check if ^ G 
|Pi]d V m g [P2ID holds, (c) If P = Pi Filter R for 
any filter condition R we return true if and only if /i G 
1^1 15 A P 1= /i. It is easy to see that the above algorithm 
runs in polynomial time. Its correctness follows from the 
definition of the algebraic operators U and a. □ 

Fragment AU: AND + UNION (Theorem|2l;2)) 

In order to show that Evaluation for this fragment is NP- 
complete we have to show membership and hardness. 

Membership in NP. Let P be a SPARQL expression 
composed of operators And and Union, D a document, 
and fj, a mapping. We provide an NP-algorithm that returns 
true if ^ G |P1d, and false otherwise. Our algorithm is 
defined on the structure of P. (a) If P = t return true if 
^l G lt\o, false otherwise, (b) If P = Pi Union P2, we 
return the truth value of ^ G [[Pi]d V /i G [[PjId- (c) If 
P = Pi And P2, we guess a decomposition /i = /ii U 112 
and return the truth value of /ii G |Pi]d A /i2 G [PjJu- 
Correctness of the algorithm follows from the definition of 
the algebraic operators 1X1 and U. It can easily be realized by 
a non-deterministic TM that runs in polynomial time, which 
proves membership in NP. 

NP -Hardness. We reduce the SetCover problem to the 
Evaluation problem for SPARQL (in polynomial time). 
SetCover is known to be NP-complete, so the reduction 
gives us the desired hardness result. 

The decision version of SetCover is defined as follows. 
Let U = {ui, . . . ,Uk} be a universe. Si, . . . Sn C U he 
sets over U, and let k be positive integer Is there a set 
/ C {1, . . . of size I / |< fc s.t. \Ji(,jSi = [/? 

We use the fixed database D {{a,b, 1)} for our en- 
coding and represent each set S'i — {xi, X2, ■ ■ ■ , Xm} by a 
SPARQL expression of the form 

Ps, := (a, b, 7Xi) And . . . And {a, b, 7X,n)- 

The set 5 = {5i, . . . , S'n} of all Si is then encoded as 

Ps :— Psi Union . . . Union Ps^. 



Finally we define the SPARQL expression 

P := Ps And . . . And Ps, 

where P5 appears exactly k times. 

It is straightforwai'd to show that SetCover is true if 
and only if ^ = {7Ui ^ 1, . . . , ?[/fe ^ 1} G IPjo- The 
intuition of the encoding is as follows. P5 encodes all subsets 
Si. A set element, say x, is represented in SPARQL by a 
mapping from variable 7X to value 1. The encoding of P 
allows us to merge (at most) k arbitrary sets Si. We finally 
check if the universe U can be constructed this way. □ 

Remark 2. The proof above relies on the fact that map- 
ping fj, is part of the input of the Evaluation problem. 
In fact, when fixing fi the resulting (modified) version 
of the Evaluation problem can be solved in PTime.D 

A.3 Fragments Including Operator OPT 

We now discuss several fragments including Opt. One 
goal here is to show that fragment O is PSPACE-complete 
(c.f. Theorem [3]i; PSPACE-completeness for all fragments 
involving Opt then follows (cf. Corollary [TJ. Given the 
PSPACE-completeness results for fragment £ — ATOU, it 
suffices to prove hardness for all smaller fragments; mem- 
bership is implicit. Our road map is as follows. 

1. We first show PSPACE-hai-dness for fragment ATO. 

2. We then show PSPACE-hardness for fragment AO. 

3. Next, a rewriting of operator And by Opt is presented, 
which can be used to eliminate all And operators in 
the proof of (2). PSPACE-completeness for O then is 
shown using this rewriting rule. 

4. Finally, we prove Theorem ID i.e. show that fragment 
f <„ is I]^_^i-complete, making use of part (1). 

Fragment AFO: AND + FILTER + OPT 

We present a (polynomial-time) reduction from QBF to 
Evaluation for fragment ATO. The QBF problem is 
known to be PSPACE-complete, so this reduction gives us the 
desired PSPACE-hardness result. Membership in PSpace, 
and hence PSPACE-completeness, then follows from Theo- 
rem[n3). QBF is defined as follows. 

QBF: given a quantified boolean formula of the form 

ip = \lxi3yi\/x2^V2 ■ ..yxmBymip, 

where ^ is a quantifier-free boolean formula, 
as input: is ip valid? 

The following proof was inspired by the proof of Theo- 
rem 3 in 1261 : we encode the inner formula ip using And 
and Filter, and then adopt the translation scheme for the 
quantifier sequence V3V3 . . . proposed in ||26l . 
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First note that, according to the problem statement, i/; is a 
quantifier-free boolean formula. We assume w.l.o.g. that ip 
is composed of A, V and We use the fixed database 

D := {(a, tv, 0), (a. tv, 1), (a, false, 0), (a. true, 1)} 

and denote hy V = {vi, . . . vi} the set of variables appear- 
ing in i/j. Formula t/j then is encoded as 

P^:={{a,tv,?Vi) And {a,tv, IV2) And . . . 

And (a, tv, Wi)) Filter f{ip), 

where /(Vj) is a function that generates a SPARQL condi- 
tion that mirrors the boolean formula ip. More precisely, / is 
defined recursively on the structure of as 

j{vi) := Wi = 1 

M^i) ■■= ^Mi) 

In our encoding P^, the AND-block generates all possible 
valuations for the variables, while the FiLTER-expression 
retains exactly those valuations that satisfy formula il>. It is 
straightforward to show that i/; is satisfiable if and only if 
there exists a mapping G I-Pvli^ ^^^^ moreover, for each 
mapping fi e l-P^/'l^' there is a truth assignment defined 
as p^{x) — IJ,{1X) for all variables 7Xi,lYi e dom{fi) 
such that /i G |-Ri/j]d if and only if satisfies t/j. Given 
P^,, we can encode the quantifier-sequence using a series of 
nested Opt statements as shown in 126). To make the proof 
self-contained, we shortly summarize this construction. 

SPARQL variables and ?yi,...r„ are 

used to represent variables xi, . . . Xm and yi, . . . , ym, re- 
spectively. In addition to these variables, we use fresh vari- 
ables ?ylo, . . -7 Am, ?-Boj ■ • and operators And and 
Opt to encode the quantifier sequence VxiElyi . . . Va:„j3i/,„. 
For each i E [m] we define two expressions Pi and Qi 

Pi := ((a, tv, IXi) And . . . And (a, tv, ?X^) And 
{a,tv,?Yi) And ... And (a, to, And 
(a, false, ?Aj_i) And (a, true, 7Ai)), 

Qi :— ((a, tv, 7Xi) And . . . And (a, tv, ?Xi) And 
(a, tv, ?Yi) And . . . And (a, tv, lY^) And 
{a,false,lBi^i) And (a, true, 7 Bi)), 

and encode as 

P^:= {{a, true, 7Bq) 

Opt (Pi Opt (Qi 
Opt (P2 Opt (Q2 

Opt (P™ Opt (Q™ And P^)) . . . ))))) 



®In '26' Ip was additionally restricted to be in CNF. We relax 
tills restriction liere. 



It can be shown that fi = {7Bq l—^ 1} G IPipio iff f is 
valid, which completes the reduction. We refer the reader to 
the proof of Theorem 3 in |26| for this part of the proof. □ 

Remark 3. The proof for this fragment {AJ^O) is sub- 
sumed by the subsequent proof, which shows PSpace- 
hardness for a smaller fragment. It was included to 
illustrate how to encode quantifier-free boolean formu- 
las that are not in CNF. Some of the following proofs 
build upon this construction. □ 

Fragment AO: AND + OPT 

We reduce the QBF problem to Evaluation for class AO. 
We encode a quantified boolean formula of the form 

ip = Va;i3?;iVx2 3y2 ■ • ■ yxm^ymip, 

where 1/) is a quantifier-free formula in conjunctive normal 
form (CNF), i.e. ip is a conjunction of clauses 

1/, = Ci A • • • A C„, 

where the Ci, 1 < i < n, are disjunctions of literals^ 
By V we denote the variables in tp and by Vc\ the variables 
appearing in clause Ci (either as positive of negative literals). 
We use the following database, which is polynomial in the 
size of the query. 

D :={{a, tv, 0), (a, tv, 1), {a, false, 0), (a, true, 1)} U 
{{a,vari,v) \ v G Vd} U {{a,v,v) \ v G V} 

For each C; = t;i V • • • V 1;^ V ^Wj+i V • • ■ V -^Vk , where the 
vi . . . Vj are positive and the 4.1 . . .Vk are negated variables 
(contained in Vd), we define a separate SPARQL expression 

Pc-M- ■■{{■■■{ 

(a, vari, Ivari) 
Opt {{a,vi,lvar.i) And {a,true,lVi))) 

Opt {{a, Vj, Ivari) And {a, true, 7 Vj))) 
Opt ((a, vj-^-i, Ivari) And (a, false, 7Vj+i))) 

Opt {{a, Vk, 7var,) And (a, false, 7Vk))) 
and encode formula ip as 
P^ :=Pci And ... And Pc„. 

It is straightforward to verify that i/j is satisfiable if and 
only if there is a mapping G IPj/jId and, moreover, for 
each fj, G [Pj/Jd there is a truth assignment defined as 
Pn{x) = p(?X) for all variables 7Xi, 7Yi G dom{ii) such 
that n G |Pi/,|d if and only if p^ satisfies -ip. Now, given 

^In the previous proof (for fragment ATO) there was no 
such restriction for formula ip. Still, it is known that QBF is 
also PSPACE-complete when restricting to formulae in CNF. 
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Pjjj, we encode the quantifier-sequence using only operators 
Opt and And, as shown in the previous proof for fragment 
AJ-O. For the resulting encoding P^p, it analogously holds 
that ^ = {7 Bo ^1} e IP^Id iff V is vaUd. □ 

We provide a small example that illustrates the translation 
scheme for QBF presented in in the proof above. 

Example 6. We show how to encode the QBF 

ip = Vxi3yi(xi <^ yi) 

= Va;i3yi((a;i V ^yi) A {^xi V yi)), 

where V = ((2^1 V -.yi) A (-.0:1 V yi)) is in CNF. It 
is easy to see that the QBF formula is a tautology. 
The variables in tp are V = {xi,yi}; further, we have 
Ci = xi V ^yi, C2 = ^xi V yi, and Vc, = Vc^ = V ^ 
{xi, yi}. Following the construction in the proof we set 
up the database 

D :— {(a, tv, 0), (a, tv, 1), (a, false, 0), (a, true, 1), 
(a,vari,xi), (a,vari,yi), {a,var2,xi), 
{a,var2,yi), {a,xi,xi), (a,yi,yi)} 

and define expression = Pc^ And , where 

Pcj^'.—{{a,vari, Ivari) 

Opt {{a,xi,7vari) And (a, true, ?Xi))) 
Opt ((a,yi, ?z;ari) And {a, false,?Yi)) 
Pc2-^{{a,var2, 7var2) 

Opt {{a,yi,7var2) And (a, irue, ?Yi))) 
Opt ((a, a;i, ?war2) And {a, false,7Xi)). 

When evaluating these expressions we get: 

{PciId = {{{7vari t-^ a;i}, {Ivan yi}} 
IXl {{Ivan xi, ?Xi ^ 1}}) 
3^ {{?mri yi,?Yi 0}} 
= {{?wari t— > xi, I— > 1}, {Ivari yi, ?li 0}} 

[^balc = {{?war2 a;i}, {7var2 >-> yi}} 
Iixi {{7var2 ^ yi,?^! 1}} 
IM {{?var2 ^ 2:1, ?Ai 0}} 
= {{7var2 t-^ a;i, ?Xi ^ 0}, {?uar2 i-> y2, ?>2 ^ 1}} 

[P^lz5 = [Pc, And PcJi, 

= {{?wari I— > xi, 7var2 <—>■ yi, ?^i i-^ 1, TYi i-^ 1}, 
{7vari H-> yi, ?uar2 i-^ xi, ?Ai i-^ 0, ?Yi i-^ 0}} 

Finally, we set up the expressions Pi and Qi, as de- 
scribed in the proof for fragment AOT 

Pi {{a,tv,7Xi) And (a,/a?se, TAq) 

And (a, irwe, ?Ai)) 
Qi:= {{a,tv,7Xi) And (a,ti;,?ri) 

And {a,false,7Bo) And {a,true,7Bi)) 

and encode the quantified boolean formula 1^9 as 

P^ := (a, true,7Bo) Opt (Pi Opt (Qi And P^,)) 

We leave it as an exercise to verify that the mapping 
fj, = {7Bo 1-^ 1} is contained in [[P^]_d. This result 
confirms that the original formula ip is valid. □ 



Fragment O: OPT-only (Theorem|3) 

We start with a transformation rule for operator And; it 
essentially expresses the key idea of the subsequent proof. 

Lemma 7. Let 

• Q, Qi,Q2, • • ■ , Qn (n- > 2) be SPARQL expressions, 

• S = vars{Q)Uvars{Qi)Uvars{Q2)'J- ■ ■yjvars{Qn), 
denote the set of variables in Q, Qi, Q2, ■ . ■ , Qn 

• D ~ {{a,true,l), {a, false,0), {a,tv,0), {a,tv,l)} 
be a fixed database, 

• 7V2,7Vz, ■ ■ ■ , 7Vn be a set of n — 1 fresh variables, 
i.e. S n {7V2,7V3, 7Vn} = holds. 

Further, we define 

g':=((. . . HQ Opt V2) Opt V3)...) Opt V„), 
0":=((. . . ((Qi Opt {Q2 Opt V2)) 
Opt (Q3 Opt V3)) 

Opt (g„ Opt V„))), 
Vi:={a,true,7Vi), and 
Vi:={a, false, 

The following claims hold. 

(1) IQId = {^li){7V2^1,..., 7Vn ^ 1} I M e Md}, 

(2) IQ' Opt (Qi And Q2 A_nd . . . And Qn)iD 
= IQ' Opt (. . . ((Q" Opt V2) 

Opt V3) 

Opt V„)iD □ 

The second part of the lemma provides a way to rewrite an 
AND-only expression that is encapsulated in the right side 
of an OPT-expression by means of an OPT-only expression. 
Before proving the lemma, we illustrate the construction by 
means of a small example. 

Example 7. Let D be the database given in the previous 
lemma and consider the expressions 

Q (a, tv, 7a) ,i.e. Md - {{7a ^ 0}, {7a ^ 1}} 

— (a, true, 7a) ,i.e. |Qi1d= {{?a ^ 1}} 
Q2 (a, false, 7b) ,i.e. [Q21d= {{?& >-> 0}} 

As for the part (1) of the lemma we observe that 

IQId^IQ Optt/21d 

- IQ Opt {a,true,7V2)jD 

= {{7a ^ 0, 7V2 ^ 1}, {7a ^ 1, 7V2 1}}. 

Concerning part (2) it holds that the left side 

IQ' Opt (Qi And Q2)Id 

= IQ1d ^ {{?a^l,?6^0}} 

= {{7a ^ 0, 7V2 ^ 1}, {7a ^l,7b^ 0, 7V2 ^ 1}} 

is equal to the right side 

IQ' Opt ((Qi Opt (Q2 Opt V2)) Opt T^)]i5_ 
= IQId 3X ({{?a 1, ?6 ^ 0, 7V2 ^ 1}} ^ IT^sId) 
= IQ'io ^ {{?a ^ 1, ?6 ^ 0, 7V2 ^ 1}} 
= {{7a ^ 0, ?V2 1}, {?a ^ 1, ?6 0, ?l/2 !}}■□ 
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Proof of Lemma We omit some technical details, 
but instead give the intuition of the encoding. (1) The first 
claim follows trivially from the definition of Q', the ob- 
servations that each Vi evaluates to {{IVi i— > 1}}, and the 
fact that all IVi are unbound in Q' (recall that, by assump- 
tion, the ?Vi are fresh variables). To prove (2), we con- 
sider the evaluation of the right side expression, in order to 
show that it yields the same result as the left side. First 
consider subexpression Q" and observe that the result of 
evaluating Qi Opt Vi is exactly the result of evaluating Qi 
extended by the binding ?Vi 1. In the sequel, we use 
QY as an abbreviation for Qi Opt Vi, i.e. we denote Q" as 
((. . . HQ, Opt QX) Opt Q^) Opt ... ) Opt Q^). Ap- 
plying semantics, we can rewrite [Q'Id into the form 

= . . ((Qi Opt QX) Opt QX) Opt ... ) Opt QDId 
= [(Qi And QX And QX And . . . And QDId U Pd, 

where we call the left subexpression of the union join 
part, and Pd at the right side is an algebra expression (over 
database D) with the following property: for each mapping 
^ e Pd there is at least one 7Vi (2 < i < n) s.t. 7Vi ^ 
dom{^). We observe that, in contrast, for each mapping 
in the join part dom{ii) D {?V2, . . . , 7Vn} holds and, even 
more, /i(?Vi) = 1, for 2 < i < n. Hence, the mappings in 
the result of the join part are identified by the property that 
7V2,W3,..., 7Vn ai-e all bound to 1. 

Let us next consider the evaluation of the larger expression 
(on the right side of the original equation) 

R := ((. . . ((Q" Opt Fa) Opt F3) Opt . . . ) Opt y„)). 

When evaluating R, we obtain exactly the mappings from 
IQ"1d, but each mapping /i e [Q"1_d is extended by bind- 
ings IVi ^ for all 7Vi ^ dom{fj.) (cf. the argumentation 
in for claim (1)). As argued before, all mappings in the join 
part of Q" are complete in the sense that all 7Vi are bound, 
so these mappings will not be affected. The remaining map- 
pings (i.e. those originating from Pq) will be extended by 
bindings 1-^ for at least one The resulting situation 
can be summarized as follows: all mappings resulting from 
the join part of Q" bind all variables 7Vi to 1; all mappings 
in Pd bind all 7Vi, but at least one of them is bound to 0. 

From part (1) we know that each mapping in maps 
all 7Vi to 1. Hence, when computing {Q' Opt Rjo = 
IQ'Id ^ [PId, the bindings 7^ ^ 1 for all /i e fQ'jo 
serves as a filter that removes the mappings in [[i?|_D origi- 
nating from Pd- This means 

IQ' Opt Rjo 

= IQ'Id ^ miD 

= IQ'Id ^ KQi And QX And QX And . . . And QYJJd 
= IQ' Opt {Qi And QX And QX And . . . And QX)Id- 

Even more, we observe that all 7Vi are already bound in 
Q' (all of them to 1), so the following rewriting is vaUd. 



IQ' Opt Rjo 

= IQ' Opt (Qi And QX And QX And . . . And Q^)]d 
= Iq' Opt (Qi And Q2 And Q3 And . . . And Qn)lD 

Thus, we have shown that the equivalence holds. This 
completes the proof. □ 

Given Lemma |7] we are now in the position to prove 
PSPACE-completeness for fragment O. As in previous 
proofs it suffices to show hardness; membership follows as 
before from the PSPACE-completeness of fragment £. 

The proof idea is the following. We show that, in the 
previous reduction from QBF to Evaluation for fragment 
AO, each And expression can be rewritten using only Opt 
operators. We start with a QBF of the form 

ip = V2;i3?/iVa;2 3j/2 • • ■ Va;,„3?/„i?/'> 

where ipisa quantifier-free formula in conjunctive normal 
form (CNF), i.e. ?/; is a conjunction of clauses 

■0 = Ci A • • ■ A C„, 

where the Ci, 1 < i < n, are disjunctions of literals. 
By V we denote the set of variables inside ip and by Vd 
the variables appearing in clause C,; (either in positive of 
negative form) and use the same database as in the proof for 
fragment AO, namely 

D :={(a, tv, 0), (a, tv, 1), (a, false, 0), (a, true, 1)} U 
{{a,vari,v) \ v G Vd} U {{a,v,v) \ v e V} 

The first modification of the proof for class AO concerns 
the encoding of clauses Ci = wiV- • -Vwj V-iWj+i V- • -V-iWfc, 
where the vi . . . vj are positive and Vj+i . . .Vk are negated 
variables. In the prior encoding we used both And and Opt 
operators to encode them. It is easy to see that we can sim- 
ply replace each And operator there through Opt without 
changing semantics. The reason is that, for all subexpres- 
sions A Opt B in the encoding, it holds that vars{A) n 
vars{B) = and \B\d 7^ 0; hence, all mappings in A are 
compatible with all mappings in B and there is at least one 
mapping in B. When applying this modification, we obtain 
the following O-encoding for clauses Ci. 

Pd-M- ■■{{■■■{ 

(a, vari, 7vari) 
Opt ((a, vi, 7vari) Opt (a, true, 7Vi))) 

Opt {{a,Vj,7vari) Opt {a,true,7Vj))) 
Opt {{a,Vj+i,7vari) Opt {a,false,7Vj+i))) 

Opt ((a, Vk, 7vari) Opt (a, false, 7Vk))), 

Let us next consider the Pi and Qi used for simulating 
the quantifier alternations. The original definition of these 
expression was given in the proof for fragment ATO. With 
a similar argumentation as before we can replace each oc- 
currence of operator And through Opt without changing 
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the semantics of the whole expression. This resuhs in the 
following O encodings for P,; and Qi,i € [m]. 

P, := ((a, tv, 7Xi) Opt . . . Opt (a, tv, ?Xi) Opt 
(o,iw,?Yi) Opt ... Opt (a, to, Opt 
(a, false, 7Ai-i) Opt (a, true, 
((a,to,?Xi) Opt ... Opt (a, tv,7Xi) Opt 
(a, tv, ?Yi) Opt . . . Opt (a, tv, 7Y,) Opt 
{a,false,7Bi-i) Opt {a,true,l Bi)), 



In the underlying proof for AO, the conjunction i/; is 
encoded as Pc^ And . . . And Pc„i thus we have not yet 
eliminated all AND-operators. We shortly summarize what 
we have achieved so far: 

P^:= {{a, true, 7Bq) 

Opt (Pi Opt (Qi 

Opt (Pm-i Opt {Qm-i 
Opt (P')))...))), where 

P' = Pm Opt (Q™ And (P^)) 

= P„^ Opt {Qra And Pci And . . . And PcJ 

Note that P' is the only expression that still contains And 
operators (where Q„i, Pc\, ■ ■ ■ , Pc„ ^6 already AND-free). 
We now exploit the rewriting given in Lemma|2| In particular, 
we replace P' in P^p by the expression Pi defined as 

Pi := Q' Opt _ _ _ 

((...((g" OPTF2) OptFs) Opt . . . ) Opt F„+i)), 
where 

g':=((. . . {{Pm Opt V2) Opt ^3) • • • ) Opt K+i), 
. . {{Qm Opt {Pc, Opt ^2)) 
Opt {Pc, Opt 1/3)) 

Opt (Pc„ Opt K+i))), 

Vi:=(a, irue, 7Vi), 
V,:^{a,false,7Vi), 

and the ?Vi (i S {2, . . . , n + 1}) are fresh variables. 

The resulting P^ is now an O-expression. From Lemma|7] 
it follows that the result of evaluating P^ is obtained from 
the result of P' by extending each mapping in P' by ad- 
ditional variables, more precisely by {?V2 ^ 1,?V3 1— > 
1, . . . ,7Vn+i 1-^ 1}, i.e. the results are identical modulo 
this extension. It is straightforward to verify that these ad- 
ditional bindings do not harm the construction, i.e. it holds 
that {7Bq 1} e {P^Id iff V is valid. □ 

S^^j-completeness of Fragment f <„ (Theorein|4) 

We start with two lemmas that will be used in the proof. 
Lemma 8. Let 

D — {{a, tv, 0), (a, tv, 1), (a, true, 1), {a, false, 0)} 



be an RDF database and F — 'ixi3yi . . .yxm^ymi^ 
{m > 1) be a QBF, where ij) \s & quantifier-free boolean 
formula. There is an £<2m encoding enc{F) of F s.t. 

1. F is valid exactly if {?Po ^ 1} e lenc{F)jD 

2. F is invalid exactly if all mappings ^' G [enc(P)]]D 
are of the form fi' — fj,[ U fi'2, where fi'i ~ and 
fi[ = {7Bo^l,7Ai^l}. □ 

Proof: The lemma follows from the PSPACE-hardness 
proof for fragment AJ-O, where we have shown how to 
encode QBF for a (possibly non-CNF) inner formula ip. □ 

Lemma 9. Let A and B SPARQL expressions for which 
the evaluation problem is in 'Ef , i > 1, and let R a 
Filter condition. The following claims hold. 

1. The Evaluation problem for the SPARQL expres- 
sion A Union B is in Sf . 

2. The Evaluation problem for the SPARQL expres- 
sion A And B is in Sf . 

3. The Evaluation problem for the SPARQL expres- 
sion A Filter P is in Sf . □ 

Proof: 1 . According to the SPARQL semantics we have 
that fj- e [A Union Bj if and only if ^ G (Aj or fi e [PJ. 
By assumption, both conditions can be checked individually 
in Ef , and so can both checks in sequence. 

2. It is easy to see that ^ € {A And P] iff /i can 
be decomposed into two compatible mappings /ii and ^2 
s.t. /i = /ii U /i2 and fii G [A] and fi2 G [^l- By assump- 
tion, testing fii G fA} (p2 e fB}) is in Sf . Since i > 1, 
this complexity class is at least Sf — NP. So we can guess 
a decomposition fj, ~ fiiU 112 and test for the two conditions 
one after the other. Hence, the whole procedure is in . 

3. G |A Filter P] holds iff ^ G which can be 
tested in Ef by assumption, and R satisfies /i, which can be 
tested in polynomial time. Since Ef D NP D P for i > 1, 
the whole procedure is still in Ef . □ 

We are now ready to prove Theorem]?] The proof divides 
into two parts, i.e. hardness and membership. The hard- 
ness proof is a reduction from QBF with a fixed number of 
quantifier alternations. Second, we prove by induction on 
the OPT-rank that there exists a E^_,_]^ -algorithm to solve the 
Evaluation problem for £<„ expressions. 

Hardness. We consider a QBF of the form 

ip = 3a;oVa;iEla;2 . . . Qxn ip, 

where n > 1, Q = 3 if n is even, Q = V if n is odd, 
and ^ is a quantifier-free boolean formula. It is known that 
the Validity problem for such formulae is E^_^]^ -complete. 
We now present a (polynomial-time) reduction from the Va- 
lidity problem for these quantified boolean formulae to 
the Evaluation problem for the 5<„ fragment, to prove 
-hardness. We distinguish two cases. 

Case 1: Let Q = 3, so the formula is of the form 
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F = 3yoVa;i3j/i . . . 'ix,r3ymi>. 

The formula F has 2m + 1 quantifier ahernations, so we 
need to find an £<2m encoding for this expressions. We 
rewrite F into an equivalent formula F = Fi V F2, where 

Fi =Vxi3?/i . . .\fxm3ym{il^ A yo), and 
F2 =\lxi3yi . . .\fxm3ym{il^ A -^yo). 

According to Lemma |8] there is a fixed document D and 
£<2m encodings enc{Fi) and enc{F2) (for Fi and F2, re- 
spectively) s.t. |enc(Fi)]£) (|enc(J2)lD) contains the map- 
ping /i — {TBq 1} if and only if Fi (^"2) is valid. Then the 
expression enc{F) = enc{Fi) Union enc{F2) contains /i 
if and only if Fi or F2 is valid, i.e. iff F is valid. Clearly, 
enc{F) is an £<2m expression, so enc{F) constitutes the 
desired £<2m encoding of the Evaluation problem. 

Case 2: Let Q = V, so the formula is of the form 

F = 3xoVyo3a;iVxi . . . Ixm^ym'tp- 

F has 2m + 2 quantifier alternations, so we need to provide 
a reduction into the £<2m+i fragment. We eliminate the 
outer 3-quantifier by rewriting F as F — Fi \/ F2, where 

Fi =Vyo3xiV2/i . . . 3x,„V2/,„(?/; A yo), and 
F2 ^Vyo3xi\/yi . . . Ela;„Vy™(V' A -^yo). 

Abstracting from the details of the inner formula, both Fi 
and F2 are of the form 

F' = Vyo3xiVj/i . . . Bxm.yym'ip', 

where ip' is a quantifier-free boolean formula. We now 
show (*) that we can encode F' by £<2m+i expressions 
enc{F') that, evaluated on a fixed document D, yields a fixed 
mapping /i exactly if F' is valid. This is sufficient, because 
then the expression enc(Fi) Union enc{F2) is an £<2m+i 
that contains /i exactly if the original formula F ~ Fi\/ F2 
is valid. We again start by rewriting F': 

F' = Vyo3xiVyi . . . 3a;mV2/,„-0' 
= -''^yoMxi^yi . . .yxra^yyn^^' 
= VF^), where 

F[ = VxiByi . . .Va;„i By™ (-.?/'' A yo), and 
^2 = VxiByi . . .Vxm 3?/™ (-.?/'' A ^yo)- 

According to Lemma [8] each F^ can be encoded by an 
£<2m expressions enc{F^) s.t., on the fixed database D 
given there, (1) ^ = {?Bo 1} e KId iff Fl is valid 
and (2) if Fl is not valid, then all mappings \enc{F^)\o 
bind both variables lAi and IBq to 1. Then the same condi- 
tions (1) and (2) hold for enc{F[) Union 6710(^2) exactly 
if Fl V F2 is valid. Now consider the expression Q = 
{{a, false, ^Ai) Opt {enc{Fi) Union enc(F2)). This ex- 
pression contains /i' = {?^i 0} (when evaluated on the 
database given in Lemma |8]l if and only if F{ V Fg is not 



valid (since otherwise, there is a compatible mapping for /i', 
namely {7Bo 1-^ 1} in [enc(Fi) Union eKc(F2)lD)- In 
summary, this means fi' G IQio if and only if -'(F{ VFg) = 
F' holds. Since both enc{Fi) and enc{F2) are £<2m ex- 
pressions, Q is contained in £<2m+i, so (*) holds. 

Membership. We next prove membership of f <„ expres- 
sions in S^+i by induction on the OPT-rank. Let us assume 
that for each f <„ expression (n g No) Evaluation is in 
^n+i- ^s Stated in Theorem [Tj2), Evaluation is Sf = 
NP-complete for OPT-free expressions, so the hypothesis 
holds for the basic case. In the induction step we increase 
the OPT-rank from n to n + 1. 

We distinguish several cases, depending on the structure 
of the expression, say A, with OPT-rank n + 1. 

Case 1: Checking if & [A Opt _B]. First note that 
A Opt B is in £<n+i, and from the definition of OPT-rank 
r it follows immediately that both A and B are in 
Hence, by induction hypothesis, both A and B can be eval- 
uated in ^n+i- By semantics, we have that fA Opt _B] = 
(A And Bj U ([A] \ [B]]), so fi is in (A Opt Bj iff it is 
generated by the (a) left or (b) right side of the union. Fol- 
lowing Lemma|9] part (a) can be checked in S^^.!. (b) The 
more interesting part is to check if /i G \ [i?]. Ac- 
cording to the semantics of operator \, this check can be 
formulated as C = Ci A C2, where Ci = /i G [A] and 
C2 = -iBfi' G : fl and fi' are compatible. By induc- 
tion hypothesis, Ci can be checked in S^+j. We argue that 
also -1C2 = 3/i' G [i?]] : fi and fi! are compatible can be 
evaluated in S^+i : we can guess a mapping /i' (in NP), then 
check if /i G |-B] (in S^+i), and finally check if /x and ji' are 
compatible (in polynomial time). Since P C NP C S^^^, 
all these checks in sequence can be done in S^+j. Check- 
ing if the inverse problem, i.e. C2, holds is then possible in 
coS^_i_i = H-n+i- Summarizing cases (a) and (b) we ob- 
serve that (a) and (b) li^j^i are contained in S^^2' 
so both checks in sequence can be executed in '^^+2^ which 
completes case 1 . 

Case 2: Checking if fi e {A And . FigureUa) shows 
the structure of a sample And expression, where the • sym- 
bols represent non-OPT operators (i.e. And, Union, or Fil- 
ter), and t stands for triple patterns. There is an arbitrary 
number of OPT-subexpression (which might, of course, con- 
tain Opt subexpression themselves). Each of these subex- 
pressions has OPT-rank < n + 1. Using the same argu- 
mentation as in case (1), the evaluation problem for all of 
them is in ^^+2- Moreover, there might be triple leaf nodes; 
the evaluation problem for such patterns is in PTime, and 
clearly P D ^n+2- Figure|4fb) illustrates the situation when 
all OPT-expressions and triple patterns have been replaced 
by the complexity of their Evaluation problem. 

We then proceed as follows. We apply Lemma |9] re- 
peatedly, folding the remaining And, Union, and Filter 
subexpressions bottom up. The lemma guarantees that these 
folding operations do not increase the complexity class, and 
it is easy to prove that Evaluation remains in S^_|_2 for 
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Figure 4: (a) AND-expression with increased OPT-rank; (b) The OPT-expressions and leaf nodes 
have been replaced by the complexity class of their EVALUATION problem. 



the whole expression. 

Case 3: Checking if ^j. G {A Union and Case 4-' 
Checking if £ fA Filter Rj. Similar to case 2. 

It is worth mentioning that the structural induction is poly- 
nomially bounded by the size of the expression when the 
nesting depth of OPT-operators is fixed (which holds by as- 
sumption), i.e. comprises only polynomially many steps. In 
all cases except case 1 the recursive calls concern subexpres- 
sions of the original expression. In case 1, /i e [A Opt BJ 
generates two checks, namely /j, G ^A And _B] and fi G 
|A| \ J-B] . These two checks might trigger recursive checks 
again. But since the nesting depth of OPT-expressions is 
restricted, it is easy to see that there is a polynomial that 
bounds the number of recursive call. □ 

A.4 Queries: Fragments Including SELECT 
Proof of Theorem |5] 

Let F be a fragment for which the Evaluation problem is 
C-complete, where C is a complexity class s.t. NP C C. We 
show that, for a query Q e F+, document D, and mapping /i, 
testing if II E IQio is contained in C (C-hardness follows 
trivially from C-hardness of fragment F). By definition, 
each query in is of the form Q = Selects (Q'), where 
5 C is a finite set of variables and Q' is an F-expression. 
By definition of operator Select, ^ e [QI_d holds if and 
only if there exists a mapping /i' e |Q']r> s.t. fi' = ^lU ji" , 
for any mapping ji" ^ /i. We observe that the domain of 
candidate mappings ji" is bounded by the set of variables 
in Q' , i.e. dom{iJL") C vars{Q'). Hence, we can guess a 
mapping /i" (this is possible since we are at least in NP) and 
subsequently check if /i' = U ji" G |Q']_d, which is also 
possible in C. The whole algorithm is in C. □ 

Proof of Theorem |6] 

First, we show that Evaluation for ^+-queries is con- 
tained in NP. By definition, each query in A+ is of the form 
Q — Selects (Q'), where 5 c is a finite set of variables 
and Q' is an AND-only expression. Further let D a docu- 
ment and fj, a mapping. To prove membership, we follow 



the approach taken in the previous proof (of Theorem|5]l and 
eliminate the SELECT-clause. More precisely, we guess a 
mapping pi" ^ /i and check if /i" U /i G [Q1_d (we refer the 
reader to the proof of Theorem |5] before for more details). 
Again, the size of the mapping to be guessed is bounded, and 
it is easy to see that the resulting algorithm is in NP. 

To prove hardness we reduce 3Sat, a prototypical NP- 
complete problem, to our problem. The proof was inspired 
by the reduction of 3 Sat to the evaluation problem for con- 
junctive queries in lITTl . The 3Sat problem is defined as 
follows. Let tp a boolean formula 

■0 = Ci A • • ■ A C„ 

in CNF, where each clause Ci is of the form 

Ci = hi V li2 V 

i.e. contains exactly three, possibly negated, literals: is ip 
satisfiable? For our encoding we use the fixed database 

D := {(1, 1, 1), (1, 1,0), (1,0, 1), (1,0,0), (0, 1, 1), 
(0,1,0), (0,0,l),(0,0,0),(0,c,l),(l,c,0)}, 

where we assume that 0,1 G / are any IRls. Further let 
V — {xi, . . . Xjn} denote the set of variables occurring in 
formula ip. We set up the SPARQL core expression 

P' := (Lt^Ll,, LI3) And . . . And iK^L^,, L*„,)_ 
And (?Xi, c, 7Xi) And . . . And (?X™, c, ?X„0 
And (O, c, 7 A), where 

if kj = Xfe, and L*^ :=?Afc if = -.Xfc. 

Finally, set P:=Select7^(P'). It is straightforward to 
verify that pi = {7 A 1} G |P]d iff ip is satisfiable. □ 

B. Algebraic Results 



B.l Proofs of the Equivalences in Figure [Ul-IV) 
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I. Idempotence and Inverse 

The two equivalences (Uldem) and (Inv) follow directly 
from the definition of operators U and \, respectively. 

(JIdem). Let be an -expression. We show that both 
directions of the equivalence hold. Consider a mapping 
G X A^. Then /i = /ii U /i2 where /ii,/i2 G A^ 
and fii ^ fi2- Each pair of distinct mappings in A^ is 
incompatible, so /ii = /i2 and, consequently, /ii U /i2 = 
Hi. By assumption, e holds and we are done. 
Consider a mapping fi G . Choose fi for both the left and 
right expression in A^ txi A^ . By assumption, /i = U is 
contained in the left side expression of the equation. □ 

(LIdem). Let A~ be an expression. Then 

A- lA A-= {A- 1X1 A') U (A" \ A-) [semantics] 
= (A- M A-) U0 [(Inv)] 
^ A- txi A" 

= A- [(JIdem)], 
which proves the equivalence. □ 

II. Associativity 

(UAss)eLnd (J Ass) are trivial (cf. E6i ). 

III. Commutativity 

(UComm) and (J Comm.) are trivial (cf. Ii26j ). 

IV. Distributivity 

( JUDistR ). We show that both directions of the equivalence 
hold. ^: First assume that /i G (Ai U A2) N A3. Then 
(according to the definition of N) is of the form /Lti2 U /Lta 
where /ii2 G U A2, /is G A3, and /X12 ~ /is. More 
precisely, /ii2 G Ai U A2 means /ii2 in Ai or in A2, so we 
distinguish two cases. If(a)/ii2 G Ai then the subexpression 
Ai 1X1 A3 on the right side generates ^ = U fi3 (choose 
fii2 from Ai and /is from A3); similarly, if (b) /ii2 G A2, 
then the expression A2 1X1 A3 on the right side generates /i. 
<^: Consider a mapping /i G (Ai 1X1 A3) U (A2 ixi A3). 
Then /x is of the form (a) /i = /ii U /is or of the form (b) 
/i = //2 U /i3 with /ii G Ai /i2 G A2, /i3 G A3 (where 
(a) /ii ^ /i3 or (b) /i2 ~ /i3 holds, respectively). Case (a): 
fix is contained in Ai, so it is also contained in Ai U A2. 
Hence, on the left-hand side we choose /ii from Ai U A2 
and 113 from A3. By assumption they are compatible and 
generate /i = /ti U ^13. Case (b) is symmetrical. □ 

(JUDistL). The equivalence follows from (JComm) and 
(JUDistR) (cf. 11261). □ 

(MUDistR). We show that both directions of the equation 
hold. Consider a mapping /i G (Ai U A2) \ A3. Hence, 
H is contained in Ai or in A2 and there is no compatible 



mapping in A3. If /i G Ai then the right side subexpression 
Ai \ A3 generates /i, in the other case A2 \ A3 generates 
does. <^=: Consider a mapping /i in (Ai \ A3) U (A2 \ A3). 
Then /i G (Ai \ A3) or /i G (A2 U A3). In the first case, 
fi is contained in Ai and there is no compatible mapping 
in A3. Clearly, /i is then also contained in Ai U A2 and 
(Ai U A2) \ A3. The second case is symmetrical. □ 

(LUDistR ). The following rewriting proves the equivalence. 

(Ai U A2) 3^ A3 

^ ((Ai U A2) M A3) U ((Ai U A2) \ A3) 

((Ai M A3) U (A2 M A3)) U ((Ai \ A3) U (A2 \ A3)) 

((Ai M A3) U (Ai \ A3)) U ((A2 M A3) U (A2 \ A3)) 
= (Ai IM A3) U (A2 1^ A3) 

Step (1) is an application of (JUDistR) and (MUDistR); 
in step (2) we applied (UAss) and (UComm). □ 

B.2 Proofs of the Equivalences in Figure [HV- VI) 

In the paper satisfaction was defined informally; to be 
self-contained, we repeat the formal definition from 1261 . 

Definition 16. Civen a mapping fj,, filter conditions R, 
Ri, R2, variables 7x, 7y, and constant c, we say that /i 
satisfies R (denoted as R), if 

1. R is bnd(7x) and 7x G dom{fj,), 

2. R is 7x = c, Ix G dom{fi), and /<(?x) = c, 

3. R is 7x =7y, {7x, 7y} C dom{iJ,), and ii{7x)—i.i(7y), 

4. R is and it is not the case that fi \= Ri, 

5. R is Ri V i?2 and /i |= i?i or ii\= R2, 

6. R is Ri A i?2 and /i |= i?i and jj, \^ R2. 

Recall that, given a set of mappings and filter condition 
R, cr_R(ri) is defined as the subset of mappings in ft that 
satisfy condition R, i.e. cr/f (il) = {/t G | /i |= R}. 

The following proposition states that function safe Vars{A) 
returns only variables that are bound in each mapping when 
evaluating A on any document D. It will be required in some 
of the subsequent proofs. 

Proposition 5. Let A be a SPARQL Algebra expression 
and let flA denote the mapping set obtained from evalu- 
ating A on any document. Then 7x G safeVars{A) =^ 
\ffi G r^A G dom^^). □ 

Proof. The proof is by induction on the structure of A and 
application of Definition|5] We omit the details. □ 

Proofs of Equivalences in Figure [HV- VI) 

(SUPush). Follows directly from Proposition 1(5) in ||26ll . □ 

(SDecompI). Follows directly from Lemma 1(1) in 1,26] . □ 
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(SDecompII). Follows directly from Lemma 1(2) in Il26ll . □ 

(SReord). Follows from the application of (SDecompI) and 
the commutativity of the boolean operator A. □ 

(Bndl), (Bndll), (Bndlll), and (BndIV) are trivial. 

(BndV). Recall that by assumption Ix ^ vars{Ai). The 
following rewriting proves the equivalence. 

{Ai IM A2) 

= c^bndi^xMi N A2) U {Ai \ A2)) [semantics] 

(Ai\A2) [(SUPush)] 

= <^bndi?x)iAl IXi A2) [*l] 

= Ai M [*2] 

*i follows immediately from assumption 7x ^ vars{Ai); 
*2 follows from the observation that ?x G safeVars{A2). □ 

(SJPush). Let /i e aR{Ai \A A2). By semantics, 

^ \= R. Furthermore, is of the form pLi U 112, where 
/ii G Ai, /i2 S A2, and /ii ~ /i2. Recall that by as- 
sumption vars{R) C safeVars{Ai), so we always have that 
dom{fii) C vars{R) (cf. Proposition |5]l, i.e. each variable 
that occurs in R is bound in mapping fii. It is easy to verify 
that R ^ ji implies R \= fii, since both mappings coincide 
in the variables that are relevant for evaluating R. Conse- 
quently afi{Ai) on the right side generates /ii, and clearly 
<^r{Ai) M A2 generates /ii U /i2 = A*- Consider a map- 
ping /i G aii{Ai) 1X1 A2. Then /i is of the form fi ~ fiiU /i2, 
/ii e Ai, /X2 G ^2, /xi ^ /i2, and /ii |= i?. It is easy to see 
that then also /xiU/Lt2 ^ i?, because (iom(/xi) C vars{R) (as 
argued in case and /ii U /i2 coincides with /ii on all vari- 
ables that are relevant for evaluating R. Hence, /i = /ii U /i2 
is generated by the left side of the equation. □ 

(SMPush). Let ^ e cfr{Ai \ A2). By semantics, 

ji € Ai and there is no 112 £ A2 compatible with /ii and 
/i 1= i?. From these preconditions it follows immediately 
that PL g CTflX^i) \ A2. Let /i e aR{Ai) \ A2. Then 
fi G Ai, fj, \= R, and there is no compatible mapping in A2. 
Clearly, then also fj, e Ai\A2 and e (Jr{Ai \ A2). □ 

(SLPush). The following rewriting proves the equivalence. 

(Tfl(Al 3^ ^2) 

= cr_R((Ai XI A2) U {Ai \ A2)) [semantics] 

= M A2) U (Tii(Ai \ A2) [r5C/P7/5/i;] 

= {aB{Ai)\A A2)^{cjr{Ai)\A2) [*] 

= cr7?(y4i) Hxi A2 [semantics] 

* denotes application of ( SJPush ) and ( SMPush ). □ 
B.3 Proofs of the Remaining Technical Results 



Proof of Proposition [T] 

Proof. Let yl^ be an expression. The proof is by 
induction on the structure of The basic case is — 



\t\. By semantics, all mappings in A~ bind exactly the 
same set of variables, and consequently the values of each 
two distinct mappings must differ in at least one variable, 
which makes them incompatible (Case 1) We assume 
that the hypothesis holds and consider an expression A~ = 
A^ [XI A2 ■ Then each mapping /i G A^ is of the form 
= U fi2 with fii e A^, fi2 G A2 , and fii ^ ^2- 
We fix p and show that each mapping /i' e A~ different 
from /i is incompatible. Any mapping in /i' G A^ that is 
different from fj, is of the form ii[ U ^'2 with ii[ G A^, 
A*2 £ ^2 '^^'^ A*i different from /ii or different from 
IJ,2- Let us w.l.o.g. assume that fi[ is different from /ii. 
By induction hypothesis, /ii is incompatible with It is 
easy to verify that then /i = /ii U ^2 is incompatible with 
p' ~ fi[ U fj.2, since /ii and /i'j^ disagree in the value of 
at least one variable. (Case 2) Let A^ — A^ \ A^ . By 
induction hypothesis, each two mappings in A'^ are pairwise 
incompatible. By semantics, A^ is a subset of A^ , so the 
incompatibility property still holds for A^ . (Case 3) Let 
A^ ~ A^ Hxi A2 ■ We rewrite the left outer join according 
to its semantics: A^ ^ A^ IXl A^ = (A^ txi A^)^) [A^ \ 
A2). As argued in cases (1) and (2), the incompatibility 
property holds for both subexpressions A'^ = A^ ixi 
and A\ = A^ \ A2 , so it suffices to show that the mappings 
in A'^ are pairwise incompatible to those in AV We observe 
that A\ is a subset of A^ . Further, each mapping /i G A'^ 
is of the form /i = /ii U /i2, where /ii G A^ , /i2 G A2 , and 
Hi ^ ii2- By assumption, each mapping in A^ , and hence 
each mapping ii'^ G is either identical to or incompatible 
with fii. (a) If fii ^ then fj.[ is incompatible with /^i, and 
consequently incompatible with /iiU//2 = M' so we are done, 
(b) Let Hi — fi'i. We observe that, by assumption, there is 
a compatible mapping (namely ^2 in A2). This means that 
^1 \ ^2 does not generate /i'j^, so we have a contradiction 
(i.e., the assumption hi = Mi was invalid). (Case 4) Let 
A — crc(^r)- Analogously to case 2, A^ is a subset of 
for which the property holds by induction hypothesis. □ 

Proof of Lemma [T] 

We provide an exhaustive set of counterexamples. 

Proof of Claim 1. In this part, we give counterexamples 
for two fragments (a) ^{^A- ^ .'^.u} _4{tx,\, ^ .<t,^}_ 

The result for the full algebra (i.e., ^{""A. ^ '''^'~>'^})fo\\ovis. 

(la) Fragment A^^'"^' ^ ^"'^"^ . We use the fixed database 
D = {(0,c, 1)}. Consider the algebra expression A = 
1{7x,c,1)Id U |(0,c, ?y)|D. It is easy to see that both 
AM A = {{?x 0},{??; ^ l},{?x ^ 0,7y ^ 1}} 
and ^ IM A = 0}, {?y 1}, {?x ^ 0, ?y ^ 1}} 

differ from A = {{7x ^ 0}, {7y ^ 1}}, which shows that 
neither (J Idem) nor (LIdem) holds for this fragment. 

(lb) Fragment y^^'^A, ^ .^.t^}. We use the fixed database 
D = {(0, /, 0), (1, t, 1), (a, to, 0), (a, tv, 1)}. Consider the 
algebra expression 



Recall that we assume set semantics. 
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A — TT{7x.-!y}{{ti ^ ^2) 3X1 ia), where 

ti |(a, tv, 

t2 := ?x)lc,and 

^3 •=^[(?z,i,?y)l,,. 

It is easy to verify that A = {{7x ^ 0},{?y 1}}. 
For A Xl A and Alxi Awe then get exactly the same resuhs 
as in part (la), and we conclude that neither (JIdem) nor 
(LIdem) holds for the fragment under consideration. 

Proof of Claim 2. Trivial. 

Proof of Claims 3 + 4- We provide counterexamples for 
each possible operator constellations. All counterexamples 
are designed for the database D ~ {(0, c, 1)}. 

Distributivity over U (Claim 3 in Lemma\^: 

• Ai\ {A2 U A3) = {Ai \ A2) U {Ai \ A3) does not 
hold, e.g. Ai = 1(0, c, ?a)lc, A2 = {{la, c, \)Id, and 
A3 = 1(0, c, lh)\D violates the equation. 

• Ai 3X1 (A2 U A3) = (Ai lA A2) U (Ai 1X1 A3) does 
not hold, e.g. Ai = [(0, c, la)lD, = {{la, c, 1)\d, 
and A3 = 1(0, c, lh)\D- violates the equation. 

Distributivity over \A (Claim 4 in Lemma{^: 

• Ai U (A2 M A3) = (Ai U A2) 1X1 (Ai U A3) does not 
hold, e.g. Ai = {{la, c, 1)1d, A2 = {{lb, c, l)]i3, and 
A3[[(0, c, ?&)1_D violates the equation. 

• (Ai M A2) U A3 = (Ai U A3) 1x1 (A2 U A3) does not 
hold (symmetrical to the previous one). 

• Ai \ (A2 ixi A3) = (Ai \ A2) M (Ai \ A3) does not 
hold, e.g. Ai = {{la, c, 1)1d, A2 = {{lb, c, l)]z3, and 
A3[[(0, c, ?&)]_D violates the equation. 

• (Ai M A2) \ A3 = (Ai \ A3) M (A2 \ A3) does not 
hold, e.g. Ai = 1(0, c, la)jD, A2 = 1(0, c, lb)jD, and 
A3[[(?a, c, 1)]]_D violates the equation. 

• Ai Iixi (A2 1X1 A3) = (Ai Iixi A2) 1X1 (Ai X A3) does 
not hold, e.g. Ai = {{la, c, 1)1^, A2 = {{lb, c, 1)Id, 
and A3 1(0, c, ?a)|£) violates the equation. 

• (Ai XI A2) IM A3 = (Ai IIXI A3) 1X1 (A2 3X1 A3) does 
not hold, e.g. Ai = {{0,c,la)iD, A2 = p,c,?6)]i3, 
and A3|(?a, c, 1)|£) violates the equation. 

Distributivity over \ (Claim 4 in Lemma\^: 

• Ai U (A2 \ A3) = (Ai U A2) \ (Ai U A3) does not 
hold, e.g. Ai = {{la, c, 1)Id, A2 = [(0, c, ?a)li5, and 
A3[(?a, c, 1)]_D violates the equation. 

• (Ai \ A2) U A3 = (Ai U A3) \ (A2 U A3) does not 
hold (symmetrical to the previous one). 

• Ai 1X1 (A2 \ A3) = (Ai XI A2) \ (Ai XI A3) does not 
hold, e.g. Ai = {{la, c, 1)Id, A2 = {{lb, c, 1)Id, and 
A3[(0, c, ?a)]D violates the equation. 

• (Ai \ A2) X A3 = (Ai X A3) \ (A2 X A3) does not 
hold (symmetrical to the previous one). 



• Ai 1X1 (A2\A3) = (Ai lA A,)\(Ai lA A3) does not 
hold, e.g. Ai = {{la, c, 1)Id, A2 = {{lb, c, 1)Id, and 
A3|(?6, c, 1)]1_D violates the equation. 

• (Ai\A2) 1X1 A3 = (Ai lA A3)\(A2 1X1 A3) does not 
hold, e.g. Ai = [(?a, c, 1)Id, A2 = {{lb, c, \)Id, and 
A3 1(0, c, ?&)]_D violates the equation. 

Distributivity over lA (Claim 4 in Lemma{^: 

• Ai U (A2 X A3) = (Ai U A2) IM (Ai U A3) does not 
hold, e.g. Ai = |(?a,c,l)li3, A2 = {{c,c,c)Id, and 
A3[(?&, c, 1)}d violates the equation. 

• (Ai 1X1 A2)UA3 = (A1UA3) IM (A2UA3) does not 
hold (symmetrical to the previous one). 

• Ai X (A2 1x1 A3) = (Ai X A2) 1x1 (Ai X A3) does 
not hold, e.g. Ai = {{la, c, 1)^0, A2 = {{lb, c, 1)Id, 
and A3 1(0, c, la)lu violates the equation. 

• (Ai 1X1 A2) X A3 EE (Ai X A3) 1x1 (A2 X A3) does 
not hold (symmetrical to the previous one). 

• Ai \ (A2 lA A3) = (Ai \ A2) 1X1 (Ai \ A3) does not 
hold, e.g. Ai = {{la, c, l)]z3, A2 = {{lb, c, 1)Id, and 
A3[(0, c, ?a)]]D violates the equation. 

• (Ai 1X1 A2) \ A3 = (Ai \ A3) 1X1 (A2 \ A3) does not 
hold, e.g. Ai = {{la, c, l)]^,, A2 = {{lb, c, l)jD, and 
A3[(0, c, lb)}D violates the equation. 

The list of counterexamples is exhaustive. □ 
Proof of Proposition|2] 

(MReord). We consider all possible mappings /x. Clearly, 
if /i is not contained in Ai, it will be neither contained in 
the right side nor in the left side of the expressions (both are 
subsets of Ai). So we can restrict our discussion to mappings 
/i e Ai. We distinguish three cases. Case (1): consider a 
mapping /i e Ai and assume there is a compatible mapping 
in A2. Then /i is not contained in Ai \ A2, and also not in 
(Ai \ A2) \ A3, which by definition is a subset of the former. 
Now consider the right-hand side of the equation and let us 
assume that /i 6 Ai \ A3 (otherwise we are done). Then, 
as there is a compatible mapping to fiin A2, the expression 
/i e (Ai \ A3) \ A2 will not contain ^. Case (2): The case 
of fi £ Ai being compatible with any mapping from A3 is 
symmetrical to (2). Case (3): Let /i £ Ai be a mapping that 
is not compatible with any mapping in A2 and A3. Then 
both (Ai \ A2) \ A3 on the left side and (Ai \ A3) \ A2 on 
the right side contain fi. In all cases, fi is contained in the 
right side exactly if it is contained in the left side. □ 

(MMUCorr). We show both directions of the equivalence. 

Letfj, E (Ai\A2)\A3. Then/x e Ai and there is neither 
a compatible mapping 112 G A2 nor a compatible mapping 
H3 S A3. Then both A2 and A3 contain only incompatible 
mappings, and clearly A2 U A3 contains only incompatible 
mappings. Hence, the right side Ai \ (A2 U A3) produces 
/i. <;=: Let /i G Ai \ (A2 U A3). Then /i G Ai and there is 
no compatible mapping in A2 U A2, which means that there 
is neither a compatible mapping in A2 nor in A3. It follows 



22 



that Ai \ A2 contains /i (as there is no compatible mapping 
in A2 and /i G Ai). From the fact that there is no compatible 
mapping in ^3, we deduce /i G {Ai \ A2) \ A3. □ 

(MJ). See Lemma 3(2) in 1261. 

(LJ). Let A^ , A2 be A~-expressions. The following se- 
quence of rewriting steps proves the equivalence. 

A^ IM A- 

= {A^ N yl^) u {A^ \ A^) [sem.] 

= (Ar N MA2-))u(yir\(Ar MA^-)) n 

= {A^ 3X1 {A^ 1x1 A^)) [sem.] 

• denotes appUcation of (JIdem), (J Ass), and (MJ). □ 
Proof of Lemma |2] 

Let A^ , A2 be -expressions, R a filter condition, and 
?x e sa/e Vars (^2 ) \ wars (Ai ) a variable that is contained in 
the set of safe variables of A2, but not in Ai. We transform the 
left side expression into the right side expression as follows. 

cr^bnd{?x){^I ^ A2) 

= CT^fond(?rr)((^r ^^2)^(^1X^2)) [semantics] 

^ a^l,nd{7x)iA_^ ^2) U 

CT^bnd(?rr)(^r \^2") [(SUPush)] 
= 0--bnd(?:r)(^r \ ^2^) 

= A^\A2 [*2] 

We first show that rewriting step *i holds. Observe that 
?x G sa/e yars(A^), which implies that (following Proposi- 
tion|5]) variable 7x is bound in each mapping generated by A2. 
Consequently, 7x is also bound in each mapping generated 
by Ai M A2 and the condition -^hnd{lx) is never satisfied 
for the join part, so it can be eliminated. Concerning step 
■¥2 we observe that Ix £ safeVars{A2) \ vars{Ai) implies 
7x ^ ?;ars(^i). It follows immediately that 7x is unbound 
in any mapping generated by A^ \ A^, so the surrounding 
filter condition always holds and can be dropped. □ 

C. Proofs of the SQO Results 

Proof of Lemma |3] 

• Let Q' e Cf i(c6s(Ci(g))) n A+. Then Ci(Q') e 
cfe(Ci(Q)). This implies Ci(Q') =s Ci(g). It fol- 
lows that Q' EEs Q- 

• Follows directly from the definition of the second trans- 
lation scheme. 

• Follows from the last two points. □ 

Proof of Lemma |4] 

• Let 0' =E Q. We have that Cf i(t7(Ci(Q'))), 

Cf i([/(Ci(g))) e therefore Ci(g') =s Cx(ff). 



Then, it follows that Ci(g') G cb^{Cx(ff)) and Q' G 
Cri(c6s(Ci(Q))). 

• Follows from the last point and bullet two in lemma|3] □ 



Proof of Lemma |5] 

• We transform Q systematically. Let D be an RDF 
database such that I? [= E. 

= i(gi oytQ2)\d 

= KQi And Qz)!/? U ([Qilz^ \ ^2!^) 
= KQi And Q2)]d U 

And (32)]_d \ [Q2ID) 

It is easy to verify that each mapping in [Qi And Q2ID 
is compatible with at least one mapping in Q2, and the 
samestillholds forthe projection TT ^,ars(Qi) [(Qi And Q2)1d- 
Hence, the right side of the union can be dropped and the 
elimination simplifies to Q =•£. {Qi And Q2). 

• Let D be an RDF database such that D |= S. Then we 
have that 

IQio - KQi Opt iQ2 And Qa))!/? 
= liQi And Q2 And Q3)Id U 

([Qi1d\IQ2 And Q^Id) 
- KQi And Q3)1d U 

(lOi And Q21d\IQ2 And Q^d)- 

We now show that ([Qi And g2lD\[(52 And Qg]/)) = 
(IQi And Q2ID \ {Qsio)- Assume that there is some 
H G ([Qi And Qaju \ IQ2 And gale). Then, for all 
£ IQ2 And QsJd it holds that fj,' is incompatible 
to /i. As is, by choice, compatible to some element in 
[Q21l>, it must be in compatible to all elements in fQaJi). 
This implies /i G (IQi And Q2}d \ IQsJd)- Assume 
we have G (IQi And Q2}d \ [Qsji?)- Choose i^' G 
IQ2 And Qs^d- It follows that the projection of v' in 
the variables in Q3 is not compatible to i/, therefore i^' is 
not compatible to i^. This implies G {{Qi And Q2lr>\ 
IQ2 And Qsio)- Consequently 

Md 

= TTsiUQi And Qg)!/? U ([Qi And Q2ID \ IQsJd)) 
= TTsiiiQi And Qg)!/? U (IQiJz, \ IQsId)) 
= [Oi Opt QsId- □ 



Proof of Lemma |6] 

• Let /I G [Qi Opt (52]_d- Then, is defined be- 
cause of Qi =s SELECT„ars(Q^)(Qi And (52)- So, 
|FiLTER^,,„d(7^)(Qi Opt Q2)]d = 0- 

• The proof of this claim is straightforward. 

• Assume that there is some /i G |FiLTER^72:=?a((32)]_D- 
So,/i|s G |Select5((52)1d- ItholdsthatSELECT5((52) 
=E Selects (Q2 If) =s Selects (FiLTER7a;=7j^(Q2))- 

It follows that ^ G |SELECTs(FlLTER7a;=7j,((52))lD, 

which is a contradiction. □ 
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D. Proofs of the Chase Termination Results 



D.l Additional Definitions 

Databases. We choose three pairwise disjoint infinite 
sets A, Anuii and V. We will refer to A as the set of con- 
stants, to AnuH as the set of labelled nulls and to V as the 
set of variables. A database schema 7?, is a finite set of re- 
lational symbols {i?i, Rn}- To every Ri € TZ we assign 
a natural number ar{Ri) € N, which we call the arity of 
Ri. The arity of TZ, denoted by ar{TV), is defined as max{ 
ar{Ri) I z G [n] }. Throughout the rest of the paper, we 
assume the database schema, the set of constants and the set 
of labelled nulls to be fixed. This is why we will suppress 
these sets in our notation. 

A database instance I is an n-tuple (/i, ...,!„), where 
/, C (A U AnuiiT''^^'^ for every i e [n]. We will de- 
note (ci, ...,CQr(fl,.)) e h hy the fact Ri{ci, .-^CariBi)) and 
therefore represent the instance / as the set if its facts. Abus- 
ing notation, we write I = { Ri{t) | < G 7^, z G [n] }. 

A position is a position in a predicate, e.g. a three-ary 
predicate R has three positions R^,R^,R^. We say that a 
variable, labelled null or constant c appears e.g. in a position 
R^ if there exists a fact i?(c, ...). 

Constraints. Let x, y be tuples of variables. We con- 
sider two types of database constraints, i.e. tuple- generating 
and equality generating dependencies. A tuple-generating 
dependency (TGD) is a first-order sentence 

ip Vx(0(x) 3yip{x,y)), 

such that (a) both 4> and i/j are conjunctions of atomic formu- 
las (possibly with parameters from A), (b) ip is not empty, (c) 
ip is possibly empty, (d) both (j> and tp do not contain equality 
atoms and (e) all variables from x that occur in must also 
occur in (j). We denote by body (if) the set of atoms in cj) and 
by head{ip) the set of atoms in ip. 

An equality generating dependency (EGD) is a first-order 
sentence 

if :— Vx{(j){x) —>■ Xi — Xj), 

where Xi, Xj occur in (j) and is a non-empty conjunction of 
equality-free 7?.-atoms (possibly with parameters from A). 
We denote by body{ip) the set of atoms in (p and by head{tp) 
the set {xi — xj}. 

For brevity, we will often omit the V-quantifier and the 
respective list of universally quantified variables. 

Constraint satisfaction. Let |= be the standard first- 
order model relationship and S be a set of TGDs and EGDs. 
We say that a database instance / = (/i, /„) satisfies S, 
denoted by / \= S, if and only if (AuA„„;i,/i, ...,/„) |= E 
in the sense of an 7^-structure. 



It is folklore that TGDs and EGDs together are expressive 
enough to express foreign key constraints, inclusion, func- 
tional, join, multivalued and embedded dependencies. Thus, 
we can capture all important semantic constraints used in 
databases. Therefore, in the rest of the paper, all sets of 
constraints are a union of TGDs and EGDs only. 

Homomorphisms. A homomorphism from a set of 
atoms to a set of atoms A2 is a mapping 

^ : A U Anuu U y ^ A U A 

7iull 

such that the following conditions hold: (a) if c G A, then 

fi{c) = c, (b) if c G AnuU, then /i(c) G A U A„„h and (c) if 
R{ci,...,Cn) G Ai, then i?(^(ci), ...,/i(c„)) G A2. 

Chase. Let E be a set of TGDs and EGDs and / an 
instance, represented as a set of atoms. We say that a TGD 
Vxtp G S is applicable to / if there is a homomorphism n 
from body{Vxip) to / and /i cannot be extended to a homo- 
morphism n' ^ fj, from head{Vx(p) to /. In such a case the 

chase step / — > J is defined as follows. We define a 
homomorphism h' as follows: (a) v agrees with /i on all uni- 
versally quantified variables in (p, (b) for every existentially 
quantified variable y in Vx(p we choose a "fresh" labelled 
null Uy G A„„H and define i'{y) :— Uy. We set J to be 
/ U i'{head{Vx(p)). We say that an EGD Vxcp G E is appli- 
cable to / if there is a homomorphism /i from body{Vxip) to 

/ and ii{xi) ^ fJ-{xj)- In such a case the chase step / j 
is defined as follows. We set J to be 

• / except that all occurrences of fi{xj) are substituted by 
fj,{xi) =: a, if n{xj) is a labelled null, 

• / except that all occurrences of fi{xi) are substituted by 
/i(a;j) =: a, if fi{xi) is a labelled null, 

• undefined, if both fJ,{xj) and iJ,{xi) are constants. In this 
case we say that the chase fails. 

A chase sequence is an exhaustive application of applicable 
constraints 

Iq > li > . . ., 

where we impose no strict order what constraint must be 
applied in case several constraints apply. If this sequence 
is finite, say 1^ being its final element, the chase terminates 
and its result is defined as J^. The length of this chase 
sequence is r. Note that different orders of application of 
applicable constraints may lead to a different chase result. 
However, as proven in |28|, two different chase orders lead 
to homomorphically equivalent results, if these exist. There- 
fore, we write for the result of the chase on an instance 
/ under constraints E. It has been shown in |24, 3, 16 1 that 
1= E. In case that a chase step cannot be performed 
(e.g., because a homomorphism would have to equate two 
constants) the chase result is undefined. In case of an infinite 
chase sequence, we also say that the result is undefined. 
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Provisio. We will make a simplifying assumption. Let / 
be a database instance and S some constraint set. Without 
loss of generality we can assume that whenever two labelled 
nulls, say yi,y2, are equated by the chase and yi G dorn{I), 
then all occurrences of j/2 are mapped to yi in the chase step. 
This does not affect chase termination as substituting yi with 
y2 would lead to an isomorphic instance. 



Theorem 14. (see JSj/) Let S be a fixed and stratified 
set of constraints. Then, there exists a polynomial Q € 
N[X] such that for any database instance /, the length 
of every chase sequence is bounded by (5(||/||), where 
|/|| is the number of distinct values in /. Thus, the 
chase terminates in polynomial time data complexity. 

□ 



D.2 Previous Results 

In the following we are only interested in constraints for 
which any chase sequence is finite. In 1281 weak acyclicity 
was introduced, which is the starting point for our work. 

Definition 17. (see Given a set of constraints E, 

its dependency graph dep(E) := {V,E) is the directed 
graph defined as follows. V is the set of positions that 
occur in the TGDs in E. There are two kind of edges 
in E. Add them as follows: for every TGD 

Vx(0(x) 3yip{x,y)) £ E 

and for every xinx that occurs in ijj and every occur- 
rence of X in (/) in position tti 

• for every occurrence oi x imp in position tt2 , add an 
edge TTi TT2 (if it does not already exist). 

• for every existentially quantified variable y and for 
every occurrence of y in a position tt2 , add a special 
edge TTi —9- 7T2 (if it does not already exist). 

A set S of TGDs and EGDs is called weakly acyclic iff 
dep(E) has no cycles through a special edge. □ 

Then, in |]8l stratification was set on top of the definition 
of weak acyclicity. The main idea is that we can test if a 
constraint can cause another constraint to fire, which is the 
intuition of the following definition. 

Definition 18. (see f^) Given two TGDs or EGDs a = 
Vxiip, (3 — Vx24'i we define a ^ (3\S there exist database 
instances /, J and a e dom{I), b e dom{J) such that 

• I ip{b), possibly b is not in dom(I), 

• / J and 

• J^Tpib). □ 

The actual definition of stratification then relies on weak 
acyclicity. 

Definition 19. (see JW) The chase graph G'(E) = (E, E) 
of a set of TGDs E contains a directed edge (a, /3) be- 
tween two constraints iS a ^ /3. We call E stratified iff 
the set of constraints in every cycle of G(E) are weakly 
acyclic. □ 

Tlieorem 13. (see 181) If a set of constraints of weakly 
acyclic, then it is also stratified. It can be decided by 
a coNP-algorithm whether a set of constraints is strat- 
ified. □ 

The crucial property of stratification is that it guarantees 
the termination of the chase in polynomially many chase 
steps. 



D.3 Proofs of tlie Teclinical Results 



Proof of Theorem |7] 

• Follows directly from the definition of the propagation 
graph. In the propagation graph stronger conditions have 
to be satisfied than in the dependency graph in order to 
add special or non-special edges. 

• Leta:= 5(^2, X3), i?(Xi, X2, X3) 3yi?(X2, y, Xi) 
and (3 R(Xi,X2,X^) S{Xi,Xs). It can be seen 
that a < fi and (3 -< a. Together with the fact that 
{a, /3} is not weakly acyclic it follows that {a, /3} is not 
stratified. However, {a, /?} is safe. 

. (see |8|) Let 7 := T(Xi,X2), T{X2,Xi) 3 ¥^¥2 
T{Xi,Yi), T{Yi,Y2), T{Y2,Xi). It was argued in JS] 
that {7} is stratified. However, it is not safe because 
both and are affected and therefore dep({7}) = 
prop({7}) and it was argued in [8| that it is not weakly 
acyclic. □ 

Proof of TheoremlH 

First we introduce some additional notation. We denote con- 
straints in the form (p{'xi,X2,u) 3yip{xT,X2,y), where 
xT, IS2, u are all the universally quantified variables and 

• u are those variables that do not occur in the head, 

• every element in aTf occurs in a non-affected position in 
the body, and 

• every element in X2 occurs only in affected positions in 
the body. 

The proof is inspired by the proof of Theorem 3.8 in ||28ll . 
especially the notation and some introductory definitions are 
taken from there. In a first step we will give the proof for 
TGDs only, i.e. we do not consider EGDs. Later, we will see 
what changes when we add EGDs again. 

Note that E is fixed. Let (V, E) be the propagation graph 
prop(E). For every position tt E V an incoming path is a, 
possibly infinite, path ending in vr. We denote by rankiji) the 
maximum number of special edges over all incoming paths. 
It holds that rank{n) < 00 because prop(E) contains no 
cycles through a special edge. Define r := max{ rankin) | 
TT G y } and p := \V\. It is easily verified that r < p, thus 
r is bounded by a constant. This allows us to partition the 
positions into sets Nq, ...,Np such that Ni contains exactly 
those positions tt with rank{n) = i. Let n be the number of 
values in /. We define dom{Y,) as the set of constants in E. 
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Choose some a := (f){xi,X2,u) 3ytp{xi,X2,y) £ S. 

Let I . . . ^ G "'H:!^^ q' and let c be the newly created 
null values in the step from G Xo G' . Then 

1 . newly introduced labelled nulls occur only in affected 
positions, 

2. O]" C dom{I) U dom{Yj) and 

3. for every labelled null Y <E oa that occurs in tt in 
and every c G c that occurs in p in ijj it holds that 

rank{'K) < rank{p). 

This intermediate claim is easily proved by induction on 
the length of the chase sequence. Now we show by induction 
on i that the number of values that can occur in any position 
in Ni in G' is bounded by some polynomial Qi in n that 
depends only on i (and, of course, S). As i < r < p, this 
implies the theorem's statement because the maximal arity 
ar{TL) of a relation is fixed. Wedenoteby body{Y.) the num- 
ber of characters of the largest body of all constraints in E. 

Case 1: i ^ 0. WeclaimthatQn(Ti):=n+|I]|-n'"'(^) ''"'^i^^^) 
is sufficient for our needs. We consider a position n ^ Nq 
and an arbitrary TGD from S such that tt occurs in the head of 
a. For simplicity we assume that it has the syntactic form of 
a. In case that there is a universally quantified variable in tt, 
there can occur at most n distinct elements in tt. Therefore, 
we assume that some existentially quantified variable occurs 
inTrin?/;- Note that as i = it must hold that |a:;2| = 0. Every 
value in / can occur in tt. But how many labelled nulls can 
be newly created in tt? For every choice of C dom{G) 
such that G |= (t>{a^^ A, 6) and G 3y%lj{ai^ X,y) at most 
one labelled null can be added to tt by a. Note that in this 
case it holds that oi C dom{I) due to (1). So, there are at 
most such choices. Over all TGDs at most 

m ■ labelled nulls are created in tt. 

Case 2: i ^ i + 1. We claim that ■— X]}=o '9j('^) + 

IS] • (Ei=o <5»(n))"'(^) ''°'^2'(^) is such a polynomial. Con- 
sider the fixed TGD a. Let tt G Ni^i. Values in tt may 
be either copied from a position in A^o U ... U A'i or may 
be a new labelled null. Therefore w.l.o.g. we assume 
that some existentially quantified variable occurs in vr in 
ip. In case a TGD, say a, is violated in G" there must exist 
01,02 C domo'iNo, Ni) and b C dom{G') such that 
G' 1= 0(ar, 02, 6), but G' 3^-0(54, 02, !/)■ If newly intro- 
duced labelled null occurs in 02, say in some position p, then 
p G Uj=o^i- Asthereareatmost(^}^oQ,(n))'"'(^) ''°''2'(^) 

many such choices for oT, 02, at most (X]j=o 
many labelled nulls can be newly created in vr. 

When we allow EGDs among our constraints, we have that 
the number of values that can occur in any position in iV^ in G' 
can be bounded by the same polynomial Qi because equating 
labelled nulls does not increase the number of labelled nulls 
and the fact that EGDs preserve valid existential conclusions 
of TGDs. □ 



Proof of Theorems Sand [10] 

Theorem |9] Follows from Theorem [12] Before we prove 
Theorem[TO] we introduce some additional tool. 

In general, a set of constraints may have several restriction 
systems. A restriction system is minimal if it is obtained 
from ((S, 0),{(q:, 0) | a G S}) by a repeated application 
of the constraints from bullets one to three in Definition [T3] 
(until all constraints hold) s.t., in case of the first and second 
bullet, the image of /(/3) is extended only by those positions 
that are required to satisfy the condition. Thus, a minimal 
restriction system can be computed by a fixedpoint iteration. 

Lemma 10. Let E be a set of constraints, (G"(S), /) a 
restriction system for E and (G^j„(E), its mini- 
mal one. 

• Let P be a set of positions and a, (3 constraints. 
Then, the mapping (P, a, (3) ^ a (31 can be 
computed by an NP-algorithm. 

• The minimal restriction system for S is unique. It 
can be computed from E in non-deterministic poly- 
nomial time. 

• It holds that E is safely restricted if and only if every 
strongly connected component in G^,j„ (E) is safe. □ 

Proof. The proof of part one of the lemma proceeds like 
the proof of Theorem 3 in |8|. It is enough to consider 
candidate databases for A of size at most \a\ -f |/3|, i.e. unions 
of homomorphic images of the premises of a and (3 s.t. null 
values occur only in positions from P. This concludes part 
one. 

Uniqueness holds by definition. It can be computed via 
successive application of the constraints (note that / and E 
are changed in each step) in definition[T3lbv a Turing machine 
that guesses answers to the question a -<p (31. As the 
mapping (P, a, /9) ^ a <p (31 can be computed by an NP- 
algorithm and the fixedpoint is reached after polynomially 
many applications of the constraints from definition[T3] this 
implies the second claim. 

Concerning the second claim, observe that every strongly 
connected component in GJ^jj„(E) is contained in a single 
strongly connected component of any other restriction sys- 
tem. This implies the third claim. □ 

Now we turn towards the proof of Theorem [10] By the 
previous lemma it suffices to check the conditions from def- 
inition [T4]only for the minimal restriction system. To decide 
whether E is not safely restricted, compute the minimal re- 
striction system, guess a strongly connected component and 
check if it is not safe. Clearly, this can be done in non- 
deterministic polynomial time. □ 

Proof of Theorem[Tl](Sketch) 

Before proving this theorem, we need a technical lemma. It 
states the most important property of restriction systems. 

Lemma 11. Let a G E and (G(E),/) a restriction sys- 
tem for E and / be a database instance. If during the 
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chase it occurs that Ji — > J2, then the set of positions 
in which null values from a that are not in do'm{I) occur 
in the body of a is contained in /(a). □ 

The proof of this lemma is by induction on the length of 
the chase sequence with which Ji was obtained from / and is 
straightforward. Note that it uses the simplifying assumption 
that we introduced at the end of appendix lD.il 

Let us now turn to the proof of TheoremfTT] Let (G" (S) , /) = 
{{G,E),f) be the minimal restriction system of E. Let 
Ci, Cm be all the pairwise different strongly connected 
components of the reflexive closure of G"(S). The graph H 
is defined as the quotient graph with respect to { (a, /?) G 
I 3i e H : a,/3 g Q }, i.e. H := G'(S)/{ G E 

3i G [m] : a, /3 e Ci }. -ff is acyclic and depends only on 
S. We show the claim by induction on the number of nodes 
n in H. 

Case: n — 1: G'(S) has a single strongly connected 
component. This component is safe by prerequisite. It fol- 
lows from Theorem|8]that the chase terminates in polynomial 
time data complexity in this case. 

Case: n n + 1: Let /i be a node in H that has no 
successors and iJ_ the union of constraints from all other 
nodes in H. The chase with terminates by induction 
hypothesis, say that the number of distinct value in this result 
is bounded by some polynomial (5_ . Chasing the constraints 
in h alone terminates, too, say that the number of distinct 
value in this result is bounded by the polynomial q. The 
firing of constraints from i/_ can cause some constraints 
from h to copy null values in their heads. Yet, the firing 
of constraints in h cannot enforce constraints from iJ_ to 
copy null values to their head (by construction of the minimal 
restriction system). If / is the database instance to be chased, 
then the number of distinct value in this result is bounded 
by Q_(||/||) + + Q-(||/||)). As E is fixed we can 

conclude that the chase terminates in polynomial time data 
complexity. □ 



is not safe. Therefore, it is also not safely stratified. 
The minimal restriction system is ((S],i?),/), where 
= and / = { (7, 0) I 7 e E }. Obviously, ev- 
ery cycle in (E, E) is safe. Hence, E is safely restricted. 

□ 

Proof of Proposition|3] 

Let (G'(E), /) be a restriction system for E such that every 
strongly connected component in G'(E) is safely stratified. 
Choose some strongly connected component C and two con- 
straints a,P E C such that a (3 for some set of positions 
P. By Proposition |6] a ^ (3 holds. As C is safely stratified, 
this means that C must also be safe. So, every cycle in G'(E) 
is also safe. □ 

Proof of Proposition |4] 

• LetE {Va;i, X2{T{e, xi, X2), T{x2, d, d) —>■ T{xi, X2, 
where d is a constant and Xi are variables. Then, E' 0. 

• An example for such a set of constraints C is constituted 
as follows. 

T{xi,d, X2) 3yT{g, e, y), T{f, d, y) 
T(xi, e, X2) 3yT{g, e, y),T{f, d, y) 
T{xi,d,X2) -> T{x2,e,xi) 
T{xi,e, X2) 3yT{x2,d, y) 

Note that d, e, /, (7 are constants. □ 



Proof of Theorem nil 

• Let E be weakly acyclic. Every cycle in G(E) is safe, 
because E is safe and weak acyclicity implies safety. Let 
E be safe. Every cycle in G(E) is safe, because E is. 

• Follows from Example|5]and the following proposition. 

Proposition 6. Let P C P' C pos(E). If a -<p /3, 
then a <pi (3. It holds that \i a <p (3, then a ^ (3. 

□ 

The proof follows from the definition of <p and 

• Consider the following TGDs. E :— {a, (3, x^ 

a := Ri{xi,X2) 3yS{xi, X2,y), 

13 := Ri{xi,X2) 3yT{xi,X2,y), 

X ■■= S{xi,X2,X3),T{x4,X5,X6) ^ T(x5, xi, ^4) and 

6 := S{xi,X2,X3),T{xi, X5, X3) T(xi, ^3, X3), 

Rl{x3,Xi),R2{x3,Xi). 

It can be seen that a^x^P^X^X^^^^^^t and 
5 < (3 holds. Thus , there is a cycle in the chase graph that 
involves all constraints. Unfortunately, the constraint set 
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